2025-05-07T20:22:35.2670720Z Current runner version: '2.323.0'
2025-05-07T20:22:35.2678202Z Runner name: 'i-011bf0f995071f8f9'
2025-05-07T20:22:35.2679116Z Machine name: 'ip-10-0-45-1'
2025-05-07T20:22:35.2681860Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:35.2684193Z Contents: read
2025-05-07T20:22:35.2684703Z Metadata: read
2025-05-07T20:22:35.2685203Z Packages: read
2025-05-07T20:22:35.2685693Z ##[endgroup]
2025-05-07T20:22:35.2687614Z Secret source: None
2025-05-07T20:22:35.2688307Z Prepare workflow directory
2025-05-07T20:22:35.3214122Z Prepare all required actions
2025-05-07T20:22:35.3252446Z Getting action download info
2025-05-07T20:22:35.5395909Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.7604293Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:36.0455133Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.5700625Z Getting action download info
2025-05-07T20:22:37.6878658Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.9027493Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.6.3, 12.6.3, clang)
2025-05-07T20:22:37.9676576Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:37.9818514Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:37.9831848Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:37.9833493Z ##[endgroup]
2025-05-07T20:22:39.2137976Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.2138678Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.2139123Z AMI Name: unknown
2025-05-07T20:22:39.2178824Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.5873893Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.5874216Z with:
2025-05-07T20:22:44.5874437Z   submodules: true
2025-05-07T20:22:44.5874676Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.5875074Z   token: ***
2025-05-07T20:22:44.5875281Z   ssh-strict: true
2025-05-07T20:22:44.5875500Z   ssh-user: git
2025-05-07T20:22:44.5875717Z   persist-credentials: true
2025-05-07T20:22:44.5875976Z   clean: true
2025-05-07T20:22:44.5876211Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.5876487Z   fetch-depth: 1
2025-05-07T20:22:44.5876707Z   fetch-tags: false
2025-05-07T20:22:44.5876923Z   show-progress: true
2025-05-07T20:22:44.5877149Z   lfs: false
2025-05-07T20:22:44.5877356Z   set-safe-directory: true
2025-05-07T20:22:44.5877614Z env:
2025-05-07T20:22:44.5877829Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.5878139Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.5878403Z   BUILD_TARGET: genai
2025-05-07T20:22:44.5878639Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.5878902Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.5879156Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.5879401Z ##[endgroup]
2025-05-07T20:22:44.7031960Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.7033174Z ##[group]Getting Git version info
2025-05-07T20:22:44.7033686Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.7034316Z [command]/usr/bin/git version
2025-05-07T20:22:44.7034590Z git version 2.47.1
2025-05-07T20:22:44.7057070Z ##[endgroup]
2025-05-07T20:22:44.7070918Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2c362f20-a439-4bf9-ac60-aa165daf02d7' before making global git config changes
2025-05-07T20:22:44.7071831Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.7075815Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.7113141Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.7116497Z ##[group]Initializing the repository
2025-05-07T20:22:44.7120666Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.7162469Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.7163302Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.7164001Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.7164564Z hint:
2025-05-07T20:22:44.7164964Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.7165368Z hint:
2025-05-07T20:22:44.7165704Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.7166248Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.7166660Z hint:
2025-05-07T20:22:44.7166894Z hint: 	git branch -m <name>
2025-05-07T20:22:44.7167388Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.7174960Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.7209182Z ##[endgroup]
2025-05-07T20:22:44.7209759Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.7212974Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.7244893Z ##[endgroup]
2025-05-07T20:22:44.7245428Z ##[group]Setting up auth
2025-05-07T20:22:44.7251139Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.7283300Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.7650524Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.7683474Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.8023126Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.8072657Z ##[endgroup]
2025-05-07T20:22:44.8073254Z ##[group]Fetching the repository
2025-05-07T20:22:44.8080693Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3845344Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3846146Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3870448Z ##[endgroup]
2025-05-07T20:22:45.3870990Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3873096Z ##[endgroup]
2025-05-07T20:22:45.3877260Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.3914676Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.3956655Z ##[group]Checking out the ref
2025-05-07T20:22:45.3960052Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.5035812Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.5036233Z 
2025-05-07T20:22:45.5036531Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.5037249Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.5037762Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.5038075Z 
2025-05-07T20:22:45.5038291Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.5038767Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.5039035Z 
2025-05-07T20:22:45.5039152Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.5039346Z 
2025-05-07T20:22:45.5039480Z Or undo this operation with:
2025-05-07T20:22:45.5039656Z 
2025-05-07T20:22:45.5039749Z   git switch -
2025-05-07T20:22:45.5040399Z 
2025-05-07T20:22:45.5040634Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.5040957Z 
2025-05-07T20:22:45.5041341Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.5050812Z ##[endgroup]
2025-05-07T20:22:45.5056117Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.5056856Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.5103340Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.5135328Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.5169952Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.5197026Z ##[endgroup]
2025-05-07T20:22:45.5197552Z ##[group]Fetching submodules
2025-05-07T20:22:45.5199958Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.5545891Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.5877003Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.5879036Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.5882501Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.5885880Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.5889501Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.5893469Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.5896517Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.5927445Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.9480646Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.4368854Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.8554312Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:47.9380797Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.2509546Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.4961794Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.6374440Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.6374926Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.6866045Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.4056862Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.4057350Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:50.6853641Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.3609760Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.3610212Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.4599220Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.7314235Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.7315109Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.4193211Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.1170188Z From https://github.com/google/googletest
2025-05-07T20:22:54.1170665Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.1570786Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:55.0643418Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:55.0644396Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:55.0729991Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.7948994Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.7949465Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:55.9057474Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:55.9076082Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:55.9409143Z Entering 'external/asmjit'
2025-05-07T20:22:55.9440619Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.9473470Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.9505157Z Entering 'external/cutlass'
2025-05-07T20:22:55.9536277Z Entering 'external/googletest'
2025-05-07T20:22:55.9567912Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9598984Z Entering 'external/json'
2025-05-07T20:22:55.9643933Z ##[endgroup]
2025-05-07T20:22:55.9644349Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:55.9650402Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:55.9982404Z Entering 'external/asmjit'
2025-05-07T20:22:56.0049012Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.0119482Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.0185088Z Entering 'external/cutlass'
2025-05-07T20:22:56.0258717Z Entering 'external/googletest'
2025-05-07T20:22:56.0327898Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.0396084Z Entering 'external/json'
2025-05-07T20:22:56.0479104Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:56.0808975Z Entering 'external/asmjit'
2025-05-07T20:22:56.0871826Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:56.0874347Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.0935311Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:56.0938426Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.1001364Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:56.1004194Z Entering 'external/cutlass'
2025-05-07T20:22:56.1064807Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:56.1067780Z Entering 'external/googletest'
2025-05-07T20:22:56.1132286Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:56.1134996Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.1195850Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:56.1199872Z Entering 'external/json'
2025-05-07T20:22:56.1260596Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:56.1346228Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.1673525Z Entering 'external/asmjit'
2025-05-07T20:22:56.1706107Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.1738355Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.1774510Z Entering 'external/cutlass'
2025-05-07T20:22:56.1805735Z Entering 'external/googletest'
2025-05-07T20:22:56.1836669Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.1869501Z Entering 'external/json'
2025-05-07T20:22:56.1915740Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.2243926Z Entering 'external/asmjit'
2025-05-07T20:22:56.2311562Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.2311885Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.2338677Z Entering 'external/cutlass'
2025-05-07T20:22:56.2371254Z Entering 'external/googletest'
2025-05-07T20:22:56.2401774Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.2434045Z Entering 'external/json'
2025-05-07T20:22:56.2477869Z ##[endgroup]
2025-05-07T20:22:56.2520152Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.2548276Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.2731375Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.2731688Z with:
2025-05-07T20:22:56.2731934Z   name: fbgemm_genai_x86_clang_py3.9_cu12.6.3.whl
2025-05-07T20:22:56.2732263Z   merge-multiple: false
2025-05-07T20:22:56.2732512Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.2732768Z   run-id: 14891846252
2025-05-07T20:22:56.2732976Z env:
2025-05-07T20:22:56.2733200Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.2733491Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.2733731Z   BUILD_TARGET: genai
2025-05-07T20:22:56.2733950Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.2734185Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.2734431Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.2734665Z ##[endgroup]
2025-05-07T20:22:56.5068765Z Downloading single artifact
2025-05-07T20:22:56.6074921Z Preparing to download the following artifacts:
2025-05-07T20:22:56.6075890Z - fbgemm_genai_x86_clang_py3.9_cu12.6.3.whl (ID: 3081363869, Size: 12542866, Expected Digest: sha256:497773d2b688d8ce372143b11ddd93f307146ed7f45f4420437a8c620b3a9aa4)
2025-05-07T20:22:56.6585072Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-762f8d52-3fb2-51f9-ac57-047851dc6d3c/artifacts/0a0e162a22a3d874d00e499951e68dc83c18e66afa6b49ef075dcdcd39d2276e.zip
2025-05-07T20:22:56.6586479Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.7450438Z (node:57021) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.7451393Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:56.9609003Z SHA256 digest of downloaded artifact is 497773d2b688d8ce372143b11ddd93f307146ed7f45f4420437a8c620b3a9aa4
2025-05-07T20:22:56.9609616Z Artifact download completed successfully.
2025-05-07T20:22:56.9609958Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:56.9614765Z Download artifact has finished successfully
2025-05-07T20:22:56.9873536Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:56.9873935Z with:
2025-05-07T20:22:56.9874155Z   driver-version: 570.133.07
2025-05-07T20:22:56.9874412Z env:
2025-05-07T20:22:56.9874637Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.9874939Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.9875191Z   BUILD_TARGET: genai
2025-05-07T20:22:56.9875483Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.9875721Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.9875984Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.9876229Z ##[endgroup]
2025-05-07T20:22:56.9965838Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:56.9966235Z with:
2025-05-07T20:22:56.9966643Z   timeout_minutes: 10
2025-05-07T20:22:56.9966873Z   max_attempts: 3
2025-05-07T20:22:56.9990170Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:57.0013026Z   retry_wait_seconds: 10
2025-05-07T20:22:57.0013295Z   polling_interval_seconds: 1
2025-05-07T20:22:57.0013561Z   warning_on_retry: true
2025-05-07T20:22:57.0013813Z   continue_on_error: false
2025-05-07T20:22:57.0014054Z env:
2025-05-07T20:22:57.0014275Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.0014590Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.0014840Z   BUILD_TARGET: genai
2025-05-07T20:22:57.0015066Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.0015316Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:57.0015583Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.0015827Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:57.0016081Z ##[endgroup]
2025-05-07T20:22:57.0813406Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:57.0814509Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:57.0817912Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.7246631Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.7247021Z No packages marked for removal.
2025-05-07T20:22:57.7308801Z Dependencies resolved.
2025-05-07T20:22:57.7318640Z Nothing to do.
2025-05-07T20:22:57.7319113Z Complete!
2025-05-07T20:22:57.7646226Z + install_nvidia_driver_common
2025-05-07T20:22:57.7650259Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.7650581Z + lspci
2025-05-07T20:22:57.7652236Z Before installing NVIDIA driver
2025-05-07T20:22:57.7834510Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.7835253Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.7835820Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.7836356Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.7836839Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.7837377Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.7837871Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.7838340Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.7838743Z + lsmod
2025-05-07T20:22:57.7879693Z Module                  Size  Used by
2025-05-07T20:22:57.7880017Z xt_conntrack           16384  1
2025-05-07T20:22:57.7880283Z nft_chain_nat          16384  3
2025-05-07T20:22:57.7880554Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.7880866Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.7881196Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.7881600Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.7882041Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.7882362Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.7882656Z xfrm_user              57344  1
2025-05-07T20:22:57.7882928Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.7883223Z xt_addrtype            16384  2
2025-05-07T20:22:57.7883484Z nft_compat             20480  4
2025-05-07T20:22:57.7883796Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.7884216Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.7884597Z br_netfilter           36864  0
2025-05-07T20:22:57.7884887Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.7885201Z stp                    16384  1 bridge
2025-05-07T20:22:57.7885495Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.7885794Z overlay               167936  0
2025-05-07T20:22:57.7886056Z tls                   135168  0
2025-05-07T20:22:57.7886316Z nls_ascii              16384  1
2025-05-07T20:22:57.7886571Z nls_cp437              20480  1
2025-05-07T20:22:57.7886828Z vfat                   24576  1
2025-05-07T20:22:57.7887092Z fat                    86016  1 vfat
2025-05-07T20:22:57.7887359Z sunrpc                696320  1
2025-05-07T20:22:57.7887613Z ena                   180224  0
2025-05-07T20:22:57.7887870Z i8042                  45056  0
2025-05-07T20:22:57.7888126Z serio                  28672  3 i8042
2025-05-07T20:22:57.7888408Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.7888814Z button                 24576  0
2025-05-07T20:22:57.7889077Z sch_fq_codel           20480  17
2025-05-07T20:22:57.7889335Z dm_mod                188416  0
2025-05-07T20:22:57.7889594Z fuse                  163840  1
2025-05-07T20:22:57.7889855Z loop                   36864  0
2025-05-07T20:22:57.7890109Z configfs               57344  1
2025-05-07T20:22:57.7890371Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.7890657Z dmi_sysfs              20480  0
2025-05-07T20:22:57.7890913Z crc32_pclmul           16384  0
2025-05-07T20:22:57.7891180Z crc32c_intel           24576  0
2025-05-07T20:22:57.7891443Z efivarfs               24576  1
2025-05-07T20:22:57.7891693Z + modinfo nvidia
2025-05-07T20:22:57.7898491Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.7898981Z import_ns:      DMA_BUF
2025-05-07T20:22:57.7899241Z alias:          char-major-195-*
2025-05-07T20:22:57.7899509Z version:        570.133.07
2025-05-07T20:22:57.7899771Z supported:      external
2025-05-07T20:22:57.7900042Z license:        Dual MIT/GPL
2025-05-07T20:22:57.7900335Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.7900682Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.7901449Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.7901793Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.7902224Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.7902594Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.7902914Z depends:        i2c-core,drm
2025-05-07T20:22:57.7903185Z retpoline:      Y
2025-05-07T20:22:57.7903404Z name:           nvidia
2025-05-07T20:22:57.7903767Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.7904246Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.7904692Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.7905249Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.7905565Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.7905867Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.7906202Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.7906509Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.7906834Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.7907201Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.7907660Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.7908003Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.7908304Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.7908622Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.7908995Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.7909401Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.7909796Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.7910231Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.7910652Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.7911084Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.7911509Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.7911861Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.7912239Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.7912628Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.7912982Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.7913310Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.7913656Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.7913992Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.7914312Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.7914663Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.7915043Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.7915385Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.7915726Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.7916082Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.7916434Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.7916782Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.7917125Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.7917422Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.7917757Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.7918093Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.7918419Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.7918764Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.7919126Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.7919483Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.7919825Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.7920175Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.7920523Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.7920916Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.7921162Z ++ command -v nvidia-smi
2025-05-07T20:22:57.7921429Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.7921693Z + set +e
2025-05-07T20:22:57.7921998Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.5923985Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.5924383Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.5924641Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.5924878Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.5925154Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.5925611Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.5926094Z + set -e
2025-05-07T20:22:59.5926661Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.5927064Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.5927538Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.5930497Z + sudo modprobe nvidia
2025-05-07T20:22:59.7343945Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.7344301Z + lspci
2025-05-07T20:22:59.7344538Z After installing NVIDIA driver
2025-05-07T20:22:59.7458655Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.7459175Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.7459735Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.7460273Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.7460754Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.7461333Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.7461851Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.7462338Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.7462742Z + lsmod
2025-05-07T20:22:59.7490954Z Module                  Size  Used by
2025-05-07T20:22:59.7491260Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.7491552Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.7491861Z drm                   602112  1 nvidia
2025-05-07T20:22:59.7492174Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.7492502Z backlight              24576  1 drm
2025-05-07T20:22:59.7492796Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.7493099Z xt_conntrack           16384  1
2025-05-07T20:22:59.7493362Z nft_chain_nat          16384  3
2025-05-07T20:22:59.7493634Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.7493948Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.7494296Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.7494695Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.7495134Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.7495459Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.7495756Z xfrm_user              57344  1
2025-05-07T20:22:59.7496035Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.7496334Z xt_addrtype            16384  2
2025-05-07T20:22:59.7496593Z nft_compat             20480  4
2025-05-07T20:22:59.7496914Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.7497335Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.7497716Z br_netfilter           36864  0
2025-05-07T20:22:59.7497994Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.7498298Z stp                    16384  1 bridge
2025-05-07T20:22:59.7498590Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.7498878Z overlay               167936  0
2025-05-07T20:22:59.7499138Z tls                   135168  0
2025-05-07T20:22:59.7499395Z nls_ascii              16384  1
2025-05-07T20:22:59.7499861Z nls_cp437              20480  1
2025-05-07T20:22:59.7500128Z vfat                   24576  1
2025-05-07T20:22:59.7500389Z fat                    86016  1 vfat
2025-05-07T20:22:59.7500654Z sunrpc                696320  1
2025-05-07T20:22:59.7500913Z ena                   180224  0
2025-05-07T20:22:59.7501240Z i8042                  45056  0
2025-05-07T20:22:59.7501498Z serio                  28672  3 i8042
2025-05-07T20:22:59.7502022Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.7502302Z button                 24576  0
2025-05-07T20:22:59.7502560Z sch_fq_codel           20480  17
2025-05-07T20:22:59.7502816Z dm_mod                188416  0
2025-05-07T20:22:59.7503067Z fuse                  163840  1
2025-05-07T20:22:59.7503321Z loop                   36864  0
2025-05-07T20:22:59.7503728Z configfs               57344  1
2025-05-07T20:22:59.7503992Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.7504273Z dmi_sysfs              20480  0
2025-05-07T20:22:59.7504530Z crc32_pclmul           16384  0
2025-05-07T20:22:59.7504802Z crc32c_intel           24576  0
2025-05-07T20:22:59.7505062Z efivarfs               24576  1
2025-05-07T20:22:59.7505316Z + modinfo nvidia
2025-05-07T20:22:59.7507960Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.7508428Z import_ns:      DMA_BUF
2025-05-07T20:22:59.7508687Z alias:          char-major-195-*
2025-05-07T20:22:59.7508957Z version:        570.133.07
2025-05-07T20:22:59.7509213Z supported:      external
2025-05-07T20:22:59.7509474Z license:        Dual MIT/GPL
2025-05-07T20:22:59.7509766Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.7510119Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.7510448Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.7510779Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.7511119Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.7511463Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.7511787Z depends:        i2c-core,drm
2025-05-07T20:22:59.7512049Z retpoline:      Y
2025-05-07T20:22:59.7512279Z name:           nvidia
2025-05-07T20:22:59.7512646Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.7513122Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.7513575Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.7513998Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.7514314Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.7514616Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.7514938Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.7515243Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.7515552Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.7515922Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.7516319Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.7516657Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.7516963Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.7517274Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.7517638Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.7518042Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.7518428Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.7518851Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.7519261Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.7519692Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.7520119Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.7520459Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.7520834Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.7521318Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.7521667Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.7521996Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.7522338Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.7522671Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.7522990Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.7523348Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.7523720Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.7524047Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.7524398Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.7524754Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.7525191Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.7525541Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.7525882Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.7526187Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.7526516Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.7526849Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.7527174Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.7527506Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.7527871Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.7528229Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.7528554Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.7528912Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.7529267Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.7529552Z + set +e
2025-05-07T20:22:59.7529764Z + nvidia-smi
2025-05-07T20:23:01.1545660Z Wed May  7 20:23:01 2025       
2025-05-07T20:23:01.1546259Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.1546801Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:01.1547284Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.1547780Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:01.1548317Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:01.1548763Z |                                         |                        |               MIG M. |
2025-05-07T20:23:01.1549102Z |=========================================+========================+======================|
2025-05-07T20:23:01.1610227Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:01.1610697Z |  0%   32C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:01.1611094Z |                                         |                        |                  N/A |
2025-05-07T20:23:01.1611487Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.1612109Z                                                                                          
2025-05-07T20:23:01.1612558Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.1613010Z | Processes:                                                                              |
2025-05-07T20:23:01.1613468Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:01.1613891Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:01.1614250Z |=========================================================================================|
2025-05-07T20:23:01.1615146Z |  No running processes found                                                             |
2025-05-07T20:23:01.1615877Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.5844062Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:02.9922067Z NVIDIA A10G
2025-05-07T20:23:03.2587470Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.2587820Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.2588067Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.2588361Z + set -e
2025-05-07T20:23:03.2588578Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.2596749Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.2599638Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.6775342Z Last metadata expiration check: 0:04:50 ago on Wed May  7 20:18:13 2025.
2025-05-07T20:23:03.7032420Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.7431976Z Dependencies resolved.
2025-05-07T20:23:03.7613704Z Nothing to do.
2025-05-07T20:23:03.7614257Z Complete!
2025-05-07T20:23:03.8012604Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.8013247Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.8014095Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.1249476Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.1826875Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.7123061Z nvidia-container-toolkit                         15 kB/s | 833  B     00:00    
2025-05-07T20:23:04.7375003Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.7772911Z Dependencies resolved.
2025-05-07T20:23:04.7954232Z ================================================================================
2025-05-07T20:23:04.7955284Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.7956034Z ================================================================================
2025-05-07T20:23:04.7956439Z Downgrading:
2025-05-07T20:23:04.7956823Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.7957433Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.7957851Z 
2025-05-07T20:23:04.7957952Z Transaction Summary
2025-05-07T20:23:04.7958214Z ================================================================================
2025-05-07T20:23:04.7958536Z Downgrade  2 Packages
2025-05-07T20:23:04.7958687Z 
2025-05-07T20:23:04.7958802Z Total download size: 6.8 M
2025-05-07T20:23:04.7959058Z Downloading Packages:
2025-05-07T20:23:04.8447529Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  26 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.8881809Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  62 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.8890460Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.8893565Z Total                                            73 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.8895974Z Running transaction check
2025-05-07T20:23:04.9001005Z Transaction check succeeded.
2025-05-07T20:23:04.9001360Z Running transaction test
2025-05-07T20:23:04.9296166Z Transaction test succeeded.
2025-05-07T20:23:04.9298495Z Running transaction
2025-05-07T20:23:05.4805799Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.5855574Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.5878432Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.6098261Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.6098860Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.6201839Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.6223011Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:06.9932958Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:06.9933583Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:06.9934146Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.9934680Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:07.1285407Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:07.1286344Z WARNING:
2025-05-07T20:23:07.1286600Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:07.1286832Z 
2025-05-07T20:23:07.1286929Z   Available Versions:
2025-05-07T20:23:07.1287082Z 
2025-05-07T20:23:07.1287187Z   Version 2023.7.20250331:
2025-05-07T20:23:07.1287505Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:07.1287760Z 
2025-05-07T20:23:07.1287896Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:07.1288110Z 
2025-05-07T20:23:07.1288199Z     Release notes:
2025-05-07T20:23:07.1296835Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:07.1297272Z 
2025-05-07T20:23:07.1297374Z   Version 2023.7.20250414:
2025-05-07T20:23:07.1297699Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:07.1297956Z 
2025-05-07T20:23:07.1298087Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:07.1298300Z 
2025-05-07T20:23:07.1298388Z     Release notes:
2025-05-07T20:23:07.1298806Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:07.1299184Z 
2025-05-07T20:23:07.1299274Z   Version 2023.7.20250428:
2025-05-07T20:23:07.1299595Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:07.1299849Z 
2025-05-07T20:23:07.1299968Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:07.1300186Z 
2025-05-07T20:23:07.1300276Z     Release notes:
2025-05-07T20:23:07.1300725Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:07.1301226Z 
2025-05-07T20:23:07.1301376Z ================================================================================
2025-05-07T20:23:07.1642947Z  
2025-05-07T20:23:07.1643128Z 
2025-05-07T20:23:07.1643420Z Downgraded:
2025-05-07T20:23:07.1643818Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:07.1644401Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:07.1644782Z 
2025-05-07T20:23:07.1644870Z Complete!
2025-05-07T20:23:07.2085183Z + sudo systemctl restart docker
2025-05-07T20:23:11.2014843Z Wed May  7 20:23:11 2025       
2025-05-07T20:23:11.2015304Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.2015830Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:11.2016323Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.2016824Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:11.2017367Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:11.2017811Z |                                         |                        |               MIG M. |
2025-05-07T20:23:11.2018162Z |=========================================+========================+======================|
2025-05-07T20:23:11.2098879Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:11.2101591Z |  0%   32C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:11.2102098Z |                                         |                        |                  N/A |
2025-05-07T20:23:11.2102513Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.2102974Z                                                                                          
2025-05-07T20:23:11.2103592Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.2104217Z | Processes:                                                                              |
2025-05-07T20:23:11.2104752Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:11.2105418Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:11.2105773Z |=========================================================================================|
2025-05-07T20:23:11.2106203Z |  No running processes found                                                             |
2025-05-07T20:23:11.2106677Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.0577219Z Command completed after 1 attempt(s).
2025-05-07T20:23:12.0664841Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:12.0665321Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:12.0679759Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.0680117Z env:
2025-05-07T20:23:12.0680345Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.0680646Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.0680896Z   BUILD_TARGET: genai
2025-05-07T20:23:12.0681142Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.0681377Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.0681643Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.0681960Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.0682336Z ##[endgroup]
2025-05-07T20:23:12.4056535Z ################################################################################
2025-05-07T20:23:12.4056895Z # Print System Info
2025-05-07T20:23:12.4057128Z #
2025-05-07T20:23:12.4073580Z # [2025-05-07T20:23:12.407Z] + print_system_info 
2025-05-07T20:23:12.4073943Z ################################################################################
2025-05-07T20:23:12.4074166Z 
2025-05-07T20:23:12.4074281Z ################################################################################
2025-05-07T20:23:12.4074624Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.4074930Z + printenv
2025-05-07T20:23:12.4075049Z 
2025-05-07T20:23:12.4097282Z SHELL=/bin/bash
2025-05-07T20:23:12.4097626Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.4098050Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.4098616Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_b440ea1e-c694-438b-b960-cd28a028bf37
2025-05-07T20:23:12.4099199Z GITHUB_ACTION=__run
2025-05-07T20:23:12.4099486Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.4099831Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.4100092Z RUNNER_NAME=i-011bf0f995071f8f9
2025-05-07T20:23:12.4100371Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.4100687Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.4100963Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.4101430Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.4101865Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.4102157Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.4102460Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.4103090Z ***
2025-05-07T20:23:12.4103298Z LOGNAME=ec2-user
2025-05-07T20:23:12.4103544Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.4103814Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.4104067Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.4104302Z SYSTEMD_EXEC_PID=55588
2025-05-07T20:23:12.4104585Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.4105141Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.4105656Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.4105939Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.4106210Z RUNNER_OS=Linux
2025-05-07T20:23:12.4106440Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.4106699Z HOME=/home/ec2-user
2025-05-07T20:23:12.4106956Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.4107252Z LANG=C.UTF-8
2025-05-07T20:23:12.4107556Z RUNNER_TRACKING_ID=github_ae3f7369-4363-4024-b8cc-9d7f5b212b73
2025-05-07T20:23:12.4107920Z RUNNER_ARCH=X64
2025-05-07T20:23:12.4108203Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.4108890Z BUILD_TARGET=genai
2025-05-07T20:23:12.4109415Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_b440ea1e-c694-438b-b960-cd28a028bf37
2025-05-07T20:23:12.4110278Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_b440ea1e-c694-438b-b960-cd28a028bf37
2025-05-07T20:23:12.4111007Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.4113359Z INVOCATION_ID=95b235614dca4c4e829ee33e73dd6c05
2025-05-07T20:23:12.4113700Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.4113969Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.4114556Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_b440ea1e-c694-438b-b960-cd28a028bf37
2025-05-07T20:23:12.4115169Z BUILD_ENV=build_binary
2025-05-07T20:23:12.4115397Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.4115623Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.4115855Z KERN_NAME_LC=linux
2025-05-07T20:23:12.4116084Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.4116392Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.4116737Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.4116981Z USER=ec2-user
2025-05-07T20:23:12.4117219Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.4117500Z SHLVL=1
2025-05-07T20:23:12.4117698Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.4118019Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.4118467Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.4118826Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.4119071Z KERN_NAME=Linux
2025-05-07T20:23:12.4119302Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.4119710Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.4120134Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.4120416Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.4120660Z JOURNAL_STREAM=8:94509
2025-05-07T20:23:12.4120976Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.4121343Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.4121656Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.4121988Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.4122212Z CI=true
2025-05-07T20:23:12.4122429Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.4122711Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.4122991Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.4123244Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.4123854Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_b440ea1e-c694-438b-b960-cd28a028bf37
2025-05-07T20:23:12.4124440Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.4124663Z _=/usr/bin/printenv
2025-05-07T20:23:12.4124810Z 
2025-05-07T20:23:12.4124933Z ################################################################################
2025-05-07T20:23:12.4125249Z [INFO] Print ldd version ...
2025-05-07T20:23:12.4125515Z + ldd --version
2025-05-07T20:23:12.4125644Z 
2025-05-07T20:23:12.4125738Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.4126003Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.4126447Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.4126978Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.4127422Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.4127643Z 
2025-05-07T20:23:12.4127762Z ################################################################################
2025-05-07T20:23:12.4128076Z [INFO] Print CPU info ...
2025-05-07T20:23:12.4128320Z + nproc
2025-05-07T20:23:12.4128430Z 
2025-05-07T20:23:12.4141832Z 16
2025-05-07T20:23:12.4143362Z 
2025-05-07T20:23:12.4143591Z + lscpu
2025-05-07T20:23:12.4143715Z 
2025-05-07T20:23:12.4251405Z Architecture:                         x86_64
2025-05-07T20:23:12.4251791Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.4252457Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4252860Z Byte Order:                           Little Endian
2025-05-07T20:23:12.4253182Z CPU(s):                               16
2025-05-07T20:23:12.4253483Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.4253800Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.4254147Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.4254470Z CPU family:                           23
2025-05-07T20:23:12.4254897Z Model:                                49
2025-05-07T20:23:12.4255192Z Thread(s) per core:                   2
2025-05-07T20:23:12.4255486Z Core(s) per socket:                   8
2025-05-07T20:23:12.4255764Z Socket(s):                            1
2025-05-07T20:23:12.4256049Z Stepping:                             0
2025-05-07T20:23:12.4256358Z BogoMIPS:                             5599.99
2025-05-07T20:23:12.4258400Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4260433Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.4260745Z Virtualization type:                  full
2025-05-07T20:23:12.4261080Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.4261525Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.4261889Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.4262289Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.4262617Z NUMA node(s):                         1
2025-05-07T20:23:12.4262912Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.4263250Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.4263623Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.4263985Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.4264339Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.4264695Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.4265066Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.4265441Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.4266084Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.4266798Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.4267576Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.4268542Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.4269571Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.4270247Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.4270619Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.4270851Z 
2025-05-07T20:23:12.4270953Z + cat /proc/cpuinfo
2025-05-07T20:23:12.4271092Z 
2025-05-07T20:23:12.4271180Z processor	: 0
2025-05-07T20:23:12.4271407Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4271661Z cpu family	: 23
2025-05-07T20:23:12.4271868Z model		: 49
2025-05-07T20:23:12.4272081Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4272334Z stepping	: 0
2025-05-07T20:23:12.4272548Z microcode	: 0x830107f
2025-05-07T20:23:12.4272889Z cpu MHz		: 3305.709
2025-05-07T20:23:12.4273114Z cache size	: 512 KB
2025-05-07T20:23:12.4273331Z physical id	: 0
2025-05-07T20:23:12.4273548Z siblings	: 16
2025-05-07T20:23:12.4273753Z core id		: 0
2025-05-07T20:23:12.4273953Z cpu cores	: 8
2025-05-07T20:23:12.4274161Z apicid		: 0
2025-05-07T20:23:12.4274367Z initial apicid	: 0
2025-05-07T20:23:12.4274578Z fpu		: yes
2025-05-07T20:23:12.4274784Z fpu_exception	: yes
2025-05-07T20:23:12.4275009Z cpuid level	: 13
2025-05-07T20:23:12.4275214Z wp		: yes
2025-05-07T20:23:12.4277273Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4279476Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4279960Z bogomips	: 5599.99
2025-05-07T20:23:12.4280183Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4280422Z clflush size	: 64
2025-05-07T20:23:12.4280641Z cache_alignment	: 64
2025-05-07T20:23:12.4280916Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4281236Z power management:
2025-05-07T20:23:12.4281376Z 
2025-05-07T20:23:12.4281464Z processor	: 1
2025-05-07T20:23:12.4281685Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4281922Z cpu family	: 23
2025-05-07T20:23:12.4282172Z model		: 49
2025-05-07T20:23:12.4282401Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4282653Z stepping	: 0
2025-05-07T20:23:12.4282860Z microcode	: 0x830107f
2025-05-07T20:23:12.4283091Z cpu MHz		: 3297.096
2025-05-07T20:23:12.4283311Z cache size	: 512 KB
2025-05-07T20:23:12.4283530Z physical id	: 0
2025-05-07T20:23:12.4283742Z siblings	: 16
2025-05-07T20:23:12.4283946Z core id		: 1
2025-05-07T20:23:12.4284142Z cpu cores	: 8
2025-05-07T20:23:12.4284345Z apicid		: 2
2025-05-07T20:23:12.4284547Z initial apicid	: 2
2025-05-07T20:23:12.4284755Z fpu		: yes
2025-05-07T20:23:12.4284960Z fpu_exception	: yes
2025-05-07T20:23:12.4285183Z cpuid level	: 13
2025-05-07T20:23:12.4285388Z wp		: yes
2025-05-07T20:23:12.4287314Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4289505Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4289996Z bogomips	: 5599.99
2025-05-07T20:23:12.4290215Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4290458Z clflush size	: 64
2025-05-07T20:23:12.4290684Z cache_alignment	: 64
2025-05-07T20:23:12.4290950Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4291268Z power management:
2025-05-07T20:23:12.4291409Z 
2025-05-07T20:23:12.4291500Z processor	: 2
2025-05-07T20:23:12.4291725Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4291975Z cpu family	: 23
2025-05-07T20:23:12.4292223Z model		: 49
2025-05-07T20:23:12.4292439Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4292678Z stepping	: 0
2025-05-07T20:23:12.4292894Z microcode	: 0x830107f
2025-05-07T20:23:12.4293126Z cpu MHz		: 3301.203
2025-05-07T20:23:12.4293341Z cache size	: 512 KB
2025-05-07T20:23:12.4293564Z physical id	: 0
2025-05-07T20:23:12.4293782Z siblings	: 16
2025-05-07T20:23:12.4294069Z core id		: 2
2025-05-07T20:23:12.4294278Z cpu cores	: 8
2025-05-07T20:23:12.4294483Z apicid		: 4
2025-05-07T20:23:12.4294681Z initial apicid	: 4
2025-05-07T20:23:12.4294897Z fpu		: yes
2025-05-07T20:23:12.4295101Z fpu_exception	: yes
2025-05-07T20:23:12.4295315Z cpuid level	: 13
2025-05-07T20:23:12.4295529Z wp		: yes
2025-05-07T20:23:12.4297563Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4299745Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4300236Z bogomips	: 5599.99
2025-05-07T20:23:12.4300453Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4300695Z clflush size	: 64
2025-05-07T20:23:12.4300919Z cache_alignment	: 64
2025-05-07T20:23:12.4301338Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4301656Z power management:
2025-05-07T20:23:12.4301798Z 
2025-05-07T20:23:12.4301888Z processor	: 3
2025-05-07T20:23:12.4302110Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4302353Z cpu family	: 23
2025-05-07T20:23:12.4302565Z model		: 49
2025-05-07T20:23:12.4302777Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4303016Z stepping	: 0
2025-05-07T20:23:12.4303228Z microcode	: 0x830107f
2025-05-07T20:23:12.4303459Z cpu MHz		: 3298.878
2025-05-07T20:23:12.4303672Z cache size	: 512 KB
2025-05-07T20:23:12.4303892Z physical id	: 0
2025-05-07T20:23:12.4304108Z siblings	: 16
2025-05-07T20:23:12.4304307Z core id		: 3
2025-05-07T20:23:12.4304515Z cpu cores	: 8
2025-05-07T20:23:12.4304718Z apicid		: 6
2025-05-07T20:23:12.4304915Z initial apicid	: 6
2025-05-07T20:23:12.4305135Z fpu		: yes
2025-05-07T20:23:12.4305350Z fpu_exception	: yes
2025-05-07T20:23:12.4305574Z cpuid level	: 13
2025-05-07T20:23:12.4305780Z wp		: yes
2025-05-07T20:23:12.4307704Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4309933Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4310422Z bogomips	: 5599.99
2025-05-07T20:23:12.4310650Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4310886Z clflush size	: 64
2025-05-07T20:23:12.4311107Z cache_alignment	: 64
2025-05-07T20:23:12.4311386Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4311702Z power management:
2025-05-07T20:23:12.4311837Z 
2025-05-07T20:23:12.4321683Z processor	: 4
2025-05-07T20:23:12.4321933Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4322224Z cpu family	: 23
2025-05-07T20:23:12.4322457Z model		: 49
2025-05-07T20:23:12.4322674Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4322931Z stepping	: 0
2025-05-07T20:23:12.4323151Z microcode	: 0x830107f
2025-05-07T20:23:12.4323380Z cpu MHz		: 3306.286
2025-05-07T20:23:12.4323604Z cache size	: 512 KB
2025-05-07T20:23:12.4323827Z physical id	: 0
2025-05-07T20:23:12.4324036Z siblings	: 16
2025-05-07T20:23:12.4324242Z core id		: 4
2025-05-07T20:23:12.4324449Z cpu cores	: 8
2025-05-07T20:23:12.4324658Z apicid		: 8
2025-05-07T20:23:12.4325007Z initial apicid	: 8
2025-05-07T20:23:12.4325231Z fpu		: yes
2025-05-07T20:23:12.4325440Z fpu_exception	: yes
2025-05-07T20:23:12.4325660Z cpuid level	: 13
2025-05-07T20:23:12.4325875Z wp		: yes
2025-05-07T20:23:12.4327919Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4330155Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4330636Z bogomips	: 5599.99
2025-05-07T20:23:12.4330874Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4331121Z clflush size	: 64
2025-05-07T20:23:12.4331335Z cache_alignment	: 64
2025-05-07T20:23:12.4331615Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4331939Z power management:
2025-05-07T20:23:12.4332076Z 
2025-05-07T20:23:12.4332170Z processor	: 5
2025-05-07T20:23:12.4332384Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4332628Z cpu family	: 23
2025-05-07T20:23:12.4332842Z model		: 49
2025-05-07T20:23:12.4333049Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4333302Z stepping	: 0
2025-05-07T20:23:12.4333517Z microcode	: 0x830107f
2025-05-07T20:23:12.4333744Z cpu MHz		: 3299.808
2025-05-07T20:23:12.4333970Z cache size	: 512 KB
2025-05-07T20:23:12.4334191Z physical id	: 0
2025-05-07T20:23:12.4334398Z siblings	: 16
2025-05-07T20:23:12.4334603Z core id		: 5
2025-05-07T20:23:12.4334808Z cpu cores	: 8
2025-05-07T20:23:12.4335007Z apicid		: 10
2025-05-07T20:23:12.4335216Z initial apicid	: 10
2025-05-07T20:23:12.4335432Z fpu		: yes
2025-05-07T20:23:12.4335637Z fpu_exception	: yes
2025-05-07T20:23:12.4335861Z cpuid level	: 13
2025-05-07T20:23:12.4336079Z wp		: yes
2025-05-07T20:23:12.4337987Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4341046Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4341609Z bogomips	: 5599.99
2025-05-07T20:23:12.4341836Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4342084Z clflush size	: 64
2025-05-07T20:23:12.4342305Z cache_alignment	: 64
2025-05-07T20:23:12.4342583Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4342900Z power management:
2025-05-07T20:23:12.4343036Z 
2025-05-07T20:23:12.4343121Z processor	: 6
2025-05-07T20:23:12.4343340Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4343586Z cpu family	: 23
2025-05-07T20:23:12.4343790Z model		: 49
2025-05-07T20:23:12.4343998Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4344243Z stepping	: 0
2025-05-07T20:23:12.4344450Z microcode	: 0x830107f
2025-05-07T20:23:12.4344685Z cpu MHz		: 2788.357
2025-05-07T20:23:12.4344907Z cache size	: 512 KB
2025-05-07T20:23:12.4345120Z physical id	: 0
2025-05-07T20:23:12.4345333Z siblings	: 16
2025-05-07T20:23:12.4345538Z core id		: 6
2025-05-07T20:23:12.4345737Z cpu cores	: 8
2025-05-07T20:23:12.4345946Z apicid		: 12
2025-05-07T20:23:12.4346154Z initial apicid	: 12
2025-05-07T20:23:12.4346369Z fpu		: yes
2025-05-07T20:23:12.4346572Z fpu_exception	: yes
2025-05-07T20:23:12.4346791Z cpuid level	: 13
2025-05-07T20:23:12.4347159Z wp		: yes
2025-05-07T20:23:12.4349211Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4351435Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4351920Z bogomips	: 5599.99
2025-05-07T20:23:12.4352150Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4352424Z clflush size	: 64
2025-05-07T20:23:12.4352655Z cache_alignment	: 64
2025-05-07T20:23:12.4352937Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4353248Z power management:
2025-05-07T20:23:12.4353387Z 
2025-05-07T20:23:12.4353471Z processor	: 7
2025-05-07T20:23:12.4353689Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4353926Z cpu family	: 23
2025-05-07T20:23:12.4354135Z model		: 49
2025-05-07T20:23:12.4354348Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4354587Z stepping	: 0
2025-05-07T20:23:12.4354798Z microcode	: 0x830107f
2025-05-07T20:23:12.4355031Z cpu MHz		: 3300.822
2025-05-07T20:23:12.4355248Z cache size	: 512 KB
2025-05-07T20:23:12.4355468Z physical id	: 0
2025-05-07T20:23:12.4355683Z siblings	: 16
2025-05-07T20:23:12.4355881Z core id		: 7
2025-05-07T20:23:12.4356078Z cpu cores	: 8
2025-05-07T20:23:12.4356280Z apicid		: 14
2025-05-07T20:23:12.4356488Z initial apicid	: 14
2025-05-07T20:23:12.4356744Z fpu		: yes
2025-05-07T20:23:12.4356942Z fpu_exception	: yes
2025-05-07T20:23:12.4357163Z cpuid level	: 13
2025-05-07T20:23:12.4357378Z wp		: yes
2025-05-07T20:23:12.4359298Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4361488Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4361974Z bogomips	: 5599.99
2025-05-07T20:23:12.4362200Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4362445Z clflush size	: 64
2025-05-07T20:23:12.4362660Z cache_alignment	: 64
2025-05-07T20:23:12.4362937Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4363265Z power management:
2025-05-07T20:23:12.4363400Z 
2025-05-07T20:23:12.4363485Z processor	: 8
2025-05-07T20:23:12.4363705Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4363943Z cpu family	: 23
2025-05-07T20:23:12.4364147Z model		: 49
2025-05-07T20:23:12.4364361Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4364608Z stepping	: 0
2025-05-07T20:23:12.4364813Z microcode	: 0x830107f
2025-05-07T20:23:12.4365045Z cpu MHz		: 3302.151
2025-05-07T20:23:12.4365263Z cache size	: 512 KB
2025-05-07T20:23:12.4365472Z physical id	: 0
2025-05-07T20:23:12.4365682Z siblings	: 16
2025-05-07T20:23:12.4365890Z core id		: 0
2025-05-07T20:23:12.4366088Z cpu cores	: 8
2025-05-07T20:23:12.4366286Z apicid		: 1
2025-05-07T20:23:12.4366484Z initial apicid	: 1
2025-05-07T20:23:12.4366698Z fpu		: yes
2025-05-07T20:23:12.4366891Z fpu_exception	: yes
2025-05-07T20:23:12.4367109Z cpuid level	: 13
2025-05-07T20:23:12.4367318Z wp		: yes
2025-05-07T20:23:12.4369218Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4371768Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4372250Z bogomips	: 5599.99
2025-05-07T20:23:12.4372471Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4372707Z clflush size	: 64
2025-05-07T20:23:12.4372920Z cache_alignment	: 64
2025-05-07T20:23:12.4373192Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4373502Z power management:
2025-05-07T20:23:12.4373635Z 
2025-05-07T20:23:12.4373724Z processor	: 9
2025-05-07T20:23:12.4373939Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4374179Z cpu family	: 23
2025-05-07T20:23:12.4374381Z model		: 49
2025-05-07T20:23:12.4374588Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4374829Z stepping	: 0
2025-05-07T20:23:12.4375033Z microcode	: 0x830107f
2025-05-07T20:23:12.4375261Z cpu MHz		: 3295.300
2025-05-07T20:23:12.4375476Z cache size	: 512 KB
2025-05-07T20:23:12.4375746Z physical id	: 0
2025-05-07T20:23:12.4375954Z siblings	: 16
2025-05-07T20:23:12.4376156Z core id		: 1
2025-05-07T20:23:12.4376364Z cpu cores	: 8
2025-05-07T20:23:12.4376564Z apicid		: 3
2025-05-07T20:23:12.4376764Z initial apicid	: 3
2025-05-07T20:23:12.4376976Z fpu		: yes
2025-05-07T20:23:12.4377169Z fpu_exception	: yes
2025-05-07T20:23:12.4377387Z cpuid level	: 13
2025-05-07T20:23:12.4377596Z wp		: yes
2025-05-07T20:23:12.4379494Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4381737Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4382225Z bogomips	: 5599.99
2025-05-07T20:23:12.4382448Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4382679Z clflush size	: 64
2025-05-07T20:23:12.4382896Z cache_alignment	: 64
2025-05-07T20:23:12.4383166Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4383474Z power management:
2025-05-07T20:23:12.4383609Z 
2025-05-07T20:23:12.4383694Z processor	: 10
2025-05-07T20:23:12.4383914Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4384155Z cpu family	: 23
2025-05-07T20:23:12.4384358Z model		: 49
2025-05-07T20:23:12.4384563Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4384805Z stepping	: 0
2025-05-07T20:23:12.4385009Z microcode	: 0x830107f
2025-05-07T20:23:12.4385238Z cpu MHz		: 3298.605
2025-05-07T20:23:12.4385455Z cache size	: 512 KB
2025-05-07T20:23:12.4385665Z physical id	: 0
2025-05-07T20:23:12.4385880Z siblings	: 16
2025-05-07T20:23:12.4386082Z core id		: 2
2025-05-07T20:23:12.4386277Z cpu cores	: 8
2025-05-07T20:23:12.4386479Z apicid		: 5
2025-05-07T20:23:12.4386682Z initial apicid	: 5
2025-05-07T20:23:12.4386889Z fpu		: yes
2025-05-07T20:23:12.4387086Z fpu_exception	: yes
2025-05-07T20:23:12.4387305Z cpuid level	: 13
2025-05-07T20:23:12.4387506Z wp		: yes
2025-05-07T20:23:12.4389408Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4391716Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4392245Z bogomips	: 5599.99
2025-05-07T20:23:12.4392577Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4392811Z clflush size	: 64
2025-05-07T20:23:12.4393031Z cache_alignment	: 64
2025-05-07T20:23:12.4393306Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4393616Z power management:
2025-05-07T20:23:12.4393753Z 
2025-05-07T20:23:12.4393841Z processor	: 11
2025-05-07T20:23:12.4394062Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4394292Z cpu family	: 23
2025-05-07T20:23:12.4394502Z model		: 49
2025-05-07T20:23:12.4394710Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4394949Z stepping	: 0
2025-05-07T20:23:12.4395163Z microcode	: 0x830107f
2025-05-07T20:23:12.4395389Z cpu MHz		: 3300.256
2025-05-07T20:23:12.4395595Z cache size	: 512 KB
2025-05-07T20:23:12.4395812Z physical id	: 0
2025-05-07T20:23:12.4396022Z siblings	: 16
2025-05-07T20:23:12.4396216Z core id		: 3
2025-05-07T20:23:12.4396421Z cpu cores	: 8
2025-05-07T20:23:12.4396621Z apicid		: 7
2025-05-07T20:23:12.4396820Z initial apicid	: 7
2025-05-07T20:23:12.4397039Z fpu		: yes
2025-05-07T20:23:12.4397239Z fpu_exception	: yes
2025-05-07T20:23:12.4397452Z cpuid level	: 13
2025-05-07T20:23:12.4397665Z wp		: yes
2025-05-07T20:23:12.4399606Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4401781Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4402261Z bogomips	: 5599.99
2025-05-07T20:23:12.4402476Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4402715Z clflush size	: 64
2025-05-07T20:23:12.4402932Z cache_alignment	: 64
2025-05-07T20:23:12.4403196Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4403513Z power management:
2025-05-07T20:23:12.4403644Z 
2025-05-07T20:23:12.4403738Z processor	: 12
2025-05-07T20:23:12.4403948Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4404187Z cpu family	: 23
2025-05-07T20:23:12.4404395Z model		: 49
2025-05-07T20:23:12.4404593Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4404844Z stepping	: 0
2025-05-07T20:23:12.4405055Z microcode	: 0x830107f
2025-05-07T20:23:12.4405279Z cpu MHz		: 3300.699
2025-05-07T20:23:12.4405489Z cache size	: 512 KB
2025-05-07T20:23:12.4405700Z physical id	: 0
2025-05-07T20:23:12.4405909Z siblings	: 16
2025-05-07T20:23:12.4406103Z core id		: 4
2025-05-07T20:23:12.4406299Z cpu cores	: 8
2025-05-07T20:23:12.4406498Z apicid		: 9
2025-05-07T20:23:12.4406689Z initial apicid	: 9
2025-05-07T20:23:12.4406902Z fpu		: yes
2025-05-07T20:23:12.4407101Z fpu_exception	: yes
2025-05-07T20:23:12.4407316Z cpuid level	: 13
2025-05-07T20:23:12.4407525Z wp		: yes
2025-05-07T20:23:12.4409426Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4411724Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4412200Z bogomips	: 5599.99
2025-05-07T20:23:12.4412417Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4412651Z clflush size	: 64
2025-05-07T20:23:12.4412865Z cache_alignment	: 64
2025-05-07T20:23:12.4413215Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4413531Z power management:
2025-05-07T20:23:12.4413661Z 
2025-05-07T20:23:12.4413749Z processor	: 13
2025-05-07T20:23:12.4413960Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4414196Z cpu family	: 23
2025-05-07T20:23:12.4414401Z model		: 49
2025-05-07T20:23:12.4414600Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4414842Z stepping	: 0
2025-05-07T20:23:12.4415050Z microcode	: 0x830107f
2025-05-07T20:23:12.4415278Z cpu MHz		: 3299.588
2025-05-07T20:23:12.4415488Z cache size	: 512 KB
2025-05-07T20:23:12.4415706Z physical id	: 0
2025-05-07T20:23:12.4415911Z siblings	: 16
2025-05-07T20:23:12.4416112Z core id		: 5
2025-05-07T20:23:12.4416309Z cpu cores	: 8
2025-05-07T20:23:12.4416508Z apicid		: 11
2025-05-07T20:23:12.4416709Z initial apicid	: 11
2025-05-07T20:23:12.4416920Z fpu		: yes
2025-05-07T20:23:12.4417113Z fpu_exception	: yes
2025-05-07T20:23:12.4417331Z cpuid level	: 13
2025-05-07T20:23:12.4417539Z wp		: yes
2025-05-07T20:23:12.4419486Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4421758Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4422282Z bogomips	: 5599.99
2025-05-07T20:23:12.4422501Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4422737Z clflush size	: 64
2025-05-07T20:23:12.4422947Z cache_alignment	: 64
2025-05-07T20:23:12.4423217Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4423533Z power management:
2025-05-07T20:23:12.4423667Z 
2025-05-07T20:23:12.4423749Z processor	: 14
2025-05-07T20:23:12.4423963Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4424199Z cpu family	: 23
2025-05-07T20:23:12.4424401Z model		: 49
2025-05-07T20:23:12.4424608Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4424852Z stepping	: 0
2025-05-07T20:23:12.4425055Z microcode	: 0x830107f
2025-05-07T20:23:12.4425279Z cpu MHz		: 3302.929
2025-05-07T20:23:12.4425499Z cache size	: 512 KB
2025-05-07T20:23:12.4425709Z physical id	: 0
2025-05-07T20:23:12.4425920Z siblings	: 16
2025-05-07T20:23:12.4426119Z core id		: 6
2025-05-07T20:23:12.4426311Z cpu cores	: 8
2025-05-07T20:23:12.4426510Z apicid		: 13
2025-05-07T20:23:12.4426715Z initial apicid	: 13
2025-05-07T20:23:12.4426924Z fpu		: yes
2025-05-07T20:23:12.4427121Z fpu_exception	: yes
2025-05-07T20:23:12.4427340Z cpuid level	: 13
2025-05-07T20:23:12.4427541Z wp		: yes
2025-05-07T20:23:12.4429487Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4431786Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4432263Z bogomips	: 5599.99
2025-05-07T20:23:12.4432482Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4432711Z clflush size	: 64
2025-05-07T20:23:12.4432936Z cache_alignment	: 64
2025-05-07T20:23:12.4433203Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4433520Z power management:
2025-05-07T20:23:12.4433651Z 
2025-05-07T20:23:12.4433834Z processor	: 15
2025-05-07T20:23:12.4434053Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.4434296Z cpu family	: 23
2025-05-07T20:23:12.4434507Z model		: 49
2025-05-07T20:23:12.4434717Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.4434959Z stepping	: 0
2025-05-07T20:23:12.4435171Z microcode	: 0x830107f
2025-05-07T20:23:12.4435401Z cpu MHz		: 3299.879
2025-05-07T20:23:12.4435608Z cache size	: 512 KB
2025-05-07T20:23:12.4435827Z physical id	: 0
2025-05-07T20:23:12.4436045Z siblings	: 16
2025-05-07T20:23:12.4436244Z core id		: 7
2025-05-07T20:23:12.4436445Z cpu cores	: 8
2025-05-07T20:23:12.4436648Z apicid		: 15
2025-05-07T20:23:12.4436849Z initial apicid	: 15
2025-05-07T20:23:12.4437067Z fpu		: yes
2025-05-07T20:23:12.4437270Z fpu_exception	: yes
2025-05-07T20:23:12.4437486Z cpuid level	: 13
2025-05-07T20:23:12.4437695Z wp		: yes
2025-05-07T20:23:12.4439610Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.4443013Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.4443500Z bogomips	: 5599.99
2025-05-07T20:23:12.4443723Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.4443961Z clflush size	: 64
2025-05-07T20:23:12.4444171Z cache_alignment	: 64
2025-05-07T20:23:12.4444442Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.4444755Z power management:
2025-05-07T20:23:12.4444885Z 
2025-05-07T20:23:12.4444889Z 
2025-05-07T20:23:12.4445021Z ################################################################################
2025-05-07T20:23:12.4445329Z [INFO] Print PCI info ...
2025-05-07T20:23:12.4445578Z + lspci -v
2025-05-07T20:23:12.4445694Z 
2025-05-07T20:23:12.4445904Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.4446286Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.4446599Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.4446814Z 
2025-05-07T20:23:12.4447015Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.4447399Z 	Physical Slot: 1
2025-05-07T20:23:12.4447645Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.4447850Z 
2025-05-07T20:23:12.4448093Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.4448528Z 	Physical Slot: 1
2025-05-07T20:23:12.4448787Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.4449010Z 
2025-05-07T20:23:12.4449275Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.4449726Z 	Physical Slot: 3
2025-05-07T20:23:12.4449966Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.4450310Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.4450661Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.4450889Z 
2025-05-07T20:23:12.4451187Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.4451855Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.4452187Z 	Physical Slot: 4
2025-05-07T20:23:12.4452450Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.4452833Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.4453180Z 	Capabilities: <access denied>
2025-05-07T20:23:12.4453444Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.4453610Z 
2025-05-07T20:23:12.4453937Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.4454418Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.4454754Z 	Physical Slot: 5
2025-05-07T20:23:12.4455001Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.4455363Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.4455738Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.4456061Z 	Capabilities: <access denied>
2025-05-07T20:23:12.4456332Z 	Kernel driver in use: ena
2025-05-07T20:23:12.4456571Z 	Kernel modules: ena
2025-05-07T20:23:12.4456717Z 
2025-05-07T20:23:12.4456886Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.4457265Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.4457560Z 	Physical Slot: 30
2025-05-07T20:23:12.4457814Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.4458194Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.4458587Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.4458965Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.4459298Z 	Capabilities: <access denied>
2025-05-07T20:23:12.4459567Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.4459817Z 	Kernel modules: nvidia
2025-05-07T20:23:12.4459971Z 
2025-05-07T20:23:12.4460269Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.4460780Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.4461066Z 	Physical Slot: 31
2025-05-07T20:23:12.4461371Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.4461728Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.4462104Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.4462437Z 	Capabilities: <access denied>
2025-05-07T20:23:12.4462708Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.4462869Z 
2025-05-07T20:23:12.4462873Z 
2025-05-07T20:23:12.4463004Z ################################################################################
2025-05-07T20:23:12.4463328Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.4471598Z + uname -a
2025-05-07T20:23:12.4471733Z 
2025-05-07T20:23:12.4472131Z Linux ip-10-0-45-1.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.4472626Z 
2025-05-07T20:23:12.4472709Z + uname -m
2025-05-07T20:23:12.4472831Z 
2025-05-07T20:23:12.4472912Z x86_64
2025-05-07T20:23:12.4473020Z 
2025-05-07T20:23:12.4473106Z + cat /proc/version
2025-05-07T20:23:12.4473246Z 
2025-05-07T20:23:12.4473776Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.4474405Z 
2025-05-07T20:23:12.4474494Z + cat /etc/os-release
2025-05-07T20:23:12.4474637Z 
2025-05-07T20:23:12.4474753Z NAME="Amazon Linux"
2025-05-07T20:23:12.4474967Z VERSION="2023"
2025-05-07T20:23:12.4475173Z ID="amzn"
2025-05-07T20:23:12.4475366Z ID_LIKE="fedora"
2025-05-07T20:23:12.4475566Z VERSION_ID="2023"
2025-05-07T20:23:12.4475803Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.4476090Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.4476373Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.4476626Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.4477140Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.4477581Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.4477996Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.4478439Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.4478808Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.4479044Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.4479335Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.4479487Z 
2025-05-07T20:23:12.4479726Z ################################################################################
2025-05-07T20:23:12.4480036Z # Print EC2 Instance Info
2025-05-07T20:23:12.4480274Z #
2025-05-07T20:23:12.4480478Z # [2025-05-07T20:23:12.446Z] + print_ec2_info 
2025-05-07T20:23:12.4480785Z ################################################################################
2025-05-07T20:23:12.4481000Z 
2025-05-07T20:23:12.4591636Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.4724525Z instance-id: i-011bf0f995071f8f9
2025-05-07T20:23:12.4841318Z instance-type: g5.4xlarge
2025-05-07T20:23:12.4880115Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.4880480Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.4889528Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.4889887Z env:
2025-05-07T20:23:12.4890112Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.4890425Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.4890682Z   BUILD_TARGET: genai
2025-05-07T20:23:12.4890913Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.4891160Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.4891429Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.4891729Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.4892094Z ##[endgroup]
2025-05-07T20:23:12.8220722Z ################################################################################
2025-05-07T20:23:12.8221271Z [INFO] Printing general display info ...
2025-05-07T20:23:12.8250575Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:12.9335574Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:12.9345705Z /usr/bin/sudo
2025-05-07T20:23:12.9356125Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:12.9366221Z /usr/bin/yum
2025-05-07T20:23:12.9367827Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:12.9388651Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.3827348Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.4606057Z ================================================================================
2025-05-07T20:23:13.4606697Z WARNING:
2025-05-07T20:23:13.4607157Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.4607593Z 
2025-05-07T20:23:13.4607766Z   Available Versions:
2025-05-07T20:23:13.4608040Z 
2025-05-07T20:23:13.4608220Z   Version 2023.7.20250331:
2025-05-07T20:23:13.4608789Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.4609300Z 
2025-05-07T20:23:13.4609548Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.4609944Z 
2025-05-07T20:23:13.4610115Z     Release notes:
2025-05-07T20:23:13.4610851Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.4611535Z 
2025-05-07T20:23:13.4611704Z   Version 2023.7.20250414:
2025-05-07T20:23:13.4612269Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.4612725Z 
2025-05-07T20:23:13.4612950Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.4613264Z 
2025-05-07T20:23:13.4613352Z     Release notes:
2025-05-07T20:23:13.4613755Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.4614118Z 
2025-05-07T20:23:13.4614218Z   Version 2023.7.20250428:
2025-05-07T20:23:13.4614533Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.4614781Z 
2025-05-07T20:23:13.4615156Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.4615379Z 
2025-05-07T20:23:13.4615469Z     Release notes:
2025-05-07T20:23:13.4615869Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.4616229Z 
2025-05-07T20:23:13.4616342Z ================================================================================
2025-05-07T20:23:13.5769021Z Dependencies resolved.
2025-05-07T20:23:13.6053771Z ================================================================================
2025-05-07T20:23:13.6054234Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.6054739Z ================================================================================
2025-05-07T20:23:13.6055063Z Upgrading:
2025-05-07T20:23:13.6055427Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.6056022Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.6056383Z 
2025-05-07T20:23:13.6056705Z Transaction Summary
2025-05-07T20:23:13.6056968Z ================================================================================
2025-05-07T20:23:13.6057285Z Upgrade  2 Packages
2025-05-07T20:23:13.6057442Z 
2025-05-07T20:23:13.6057595Z Total download size: 6.9 M
2025-05-07T20:23:13.6058593Z Downloading Packages:
2025-05-07T20:23:13.6510384Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  28 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.7053967Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  58 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.7062625Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.7065553Z Total                                            69 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.7068159Z Running transaction check
2025-05-07T20:23:13.7166979Z Transaction check succeeded.
2025-05-07T20:23:13.7167917Z Running transaction test
2025-05-07T20:23:13.7463614Z Transaction test succeeded.
2025-05-07T20:23:13.7466488Z Running transaction
2025-05-07T20:23:14.2979563Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.4030757Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.4051349Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.4253274Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.4254505Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.4358596Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.4380149Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.5801523Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.5802115Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.5802687Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.5803227Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.7311252Z ================================================================================
2025-05-07T20:23:14.7311989Z WARNING:
2025-05-07T20:23:14.7312484Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.7312823Z 
2025-05-07T20:23:14.7312938Z   Available Versions:
2025-05-07T20:23:14.7313121Z 
2025-05-07T20:23:14.7313215Z   Version 2023.7.20250331:
2025-05-07T20:23:14.7313547Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.7313811Z 
2025-05-07T20:23:14.7313941Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.7314168Z 
2025-05-07T20:23:14.7314257Z     Release notes:
2025-05-07T20:23:14.7314677Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.7315382Z 
2025-05-07T20:23:14.7315499Z   Version 2023.7.20250414:
2025-05-07T20:23:14.7315811Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.7316072Z 
2025-05-07T20:23:14.7316189Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.7316404Z 
2025-05-07T20:23:14.7316500Z     Release notes:
2025-05-07T20:23:14.7316895Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.7317270Z 
2025-05-07T20:23:14.7317362Z   Version 2023.7.20250428:
2025-05-07T20:23:14.7317677Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.7317925Z 
2025-05-07T20:23:14.7318049Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.7318264Z 
2025-05-07T20:23:14.7318352Z     Release notes:
2025-05-07T20:23:14.7318752Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.7319121Z 
2025-05-07T20:23:14.7319473Z ================================================================================
2025-05-07T20:23:14.7883713Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.7884098Z 
2025-05-07T20:23:14.7884192Z Upgraded:
2025-05-07T20:23:14.7884550Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.7885135Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.7885474Z 
2025-05-07T20:23:14.7885570Z Complete!
2025-05-07T20:23:14.8316142Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:14.8337529Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.2752854Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.2994508Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.3396062Z Dependencies resolved.
2025-05-07T20:23:15.3574454Z ================================================================================
2025-05-07T20:23:15.3575483Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.3575957Z ================================================================================
2025-05-07T20:23:15.3576272Z Installing:
2025-05-07T20:23:15.3576572Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.3576851Z 
2025-05-07T20:23:15.3576952Z Transaction Summary
2025-05-07T20:23:15.3577200Z ================================================================================
2025-05-07T20:23:15.3577521Z Install  1 Package
2025-05-07T20:23:15.3577666Z 
2025-05-07T20:23:15.3577772Z Total download size: 319 k
2025-05-07T20:23:15.3578035Z Installed size: 837 k
2025-05-07T20:23:15.3578793Z Downloading Packages:
2025-05-07T20:23:15.4276576Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        7.7 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.4282090Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.4284827Z Total                                           4.4 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.4441511Z Running transaction check
2025-05-07T20:23:15.4496618Z Transaction check succeeded.
2025-05-07T20:23:15.4497377Z Running transaction test
2025-05-07T20:23:15.4953241Z Transaction test succeeded.
2025-05-07T20:23:15.4956483Z Running transaction
2025-05-07T20:23:15.5974914Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.6476704Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.8293862Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.9509061Z ================================================================================
2025-05-07T20:23:15.9509421Z WARNING:
2025-05-07T20:23:15.9509705Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.9510411Z 
2025-05-07T20:23:15.9510545Z   Available Versions:
2025-05-07T20:23:15.9510779Z 
2025-05-07T20:23:15.9510918Z   Version 2023.7.20250331:
2025-05-07T20:23:15.9511375Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.9511737Z 
2025-05-07T20:23:15.9511909Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.9512224Z 
2025-05-07T20:23:15.9512347Z     Release notes:
2025-05-07T20:23:15.9512935Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.9513544Z 
2025-05-07T20:23:15.9513672Z   Version 2023.7.20250414:
2025-05-07T20:23:15.9514119Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.9514472Z 
2025-05-07T20:23:15.9514647Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.9515169Z 
2025-05-07T20:23:15.9515271Z     Release notes:
2025-05-07T20:23:15.9515682Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.9516058Z 
2025-05-07T20:23:15.9516369Z   Version 2023.7.20250428:
2025-05-07T20:23:15.9516684Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.9516941Z 
2025-05-07T20:23:15.9517066Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.9517281Z 
2025-05-07T20:23:15.9517375Z     Release notes:
2025-05-07T20:23:15.9517763Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.9518134Z 
2025-05-07T20:23:15.9518249Z ================================================================================
2025-05-07T20:23:15.9859691Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.9860031Z 
2025-05-07T20:23:15.9860128Z Installed:
2025-05-07T20:23:15.9860443Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:15.9860746Z 
2025-05-07T20:23:15.9860831Z Complete!
2025-05-07T20:23:16.0300256Z + hostname
2025-05-07T20:23:16.0300964Z 
2025-05-07T20:23:16.0313015Z ip-10-0-45-1.ec2.internal
2025-05-07T20:23:16.0314277Z 
2025-05-07T20:23:16.0314555Z + sudo lshw -C display
2025-05-07T20:23:16.0314723Z 
2025-05-07T20:23:16.5262134Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.5262676Z        description: VGA compatible controller
2025-05-07T20:23:16.5263199Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.5263674Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.5264085Z        physical id: 3
2025-05-07T20:23:16.5264454Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.5264862Z        version: 00
2025-05-07T20:23:16.5265194Z        width: 32 bits
2025-05-07T20:23:16.5265543Z        clock: 33MHz
2025-05-07T20:23:16.5265919Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.5266419Z        configuration: latency=0
2025-05-07T20:23:16.5266881Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.5267340Z   *-display:1
2025-05-07T20:23:16.5267656Z        description: 3D controller
2025-05-07T20:23:16.5268075Z        product: GA102GL [A10G]
2025-05-07T20:23:16.5268475Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.5268864Z        physical id: 1e
2025-05-07T20:23:16.5269205Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.5269576Z        version: a1
2025-05-07T20:23:16.5269897Z        width: 64 bits
2025-05-07T20:23:16.5270231Z        clock: 33MHz
2025-05-07T20:23:16.5270655Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.5271156Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.5271998Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.5305376Z 
2025-05-07T20:23:16.5305614Z ################################################################################
2025-05-07T20:23:16.5305963Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.5436752Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.5611729Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.5612543Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.5613439Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.5613977Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.5614487Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.5615026Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.5615458Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.5615805Z |=========================================+========================+======================|
2025-05-07T20:23:16.5694305Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.5695141Z |  0%   33C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.5695530Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.5695937Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.5696347Z                                                                                          
2025-05-07T20:23:16.5696745Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.5697176Z | Processes:                                                                              |
2025-05-07T20:23:16.5697623Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.5698041Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.5698404Z |=========================================================================================|
2025-05-07T20:23:16.5699039Z |  No running processes found                                                             |
2025-05-07T20:23:16.5699518Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.7090005Z ################################################################################
2025-05-07T20:23:16.7090382Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.7233045Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.7233806Z [CHECK] rocminfo not found
2025-05-07T20:23:16.7243283Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.7244290Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.7289689Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.7290128Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.7301708Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.7302074Z env:
2025-05-07T20:23:16.7302306Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.7302606Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.7302856Z   BUILD_TARGET: genai
2025-05-07T20:23:16.7303089Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.7303324Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:16.7303591Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.7303905Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.7304231Z ##[endgroup]
2025-05-07T20:23:17.0640591Z ################################################################################
2025-05-07T20:23:17.0641043Z # Setup Miniconda
2025-05-07T20:23:17.0641271Z #
2025-05-07T20:23:17.0657071Z # [2025-05-07T20:23:17.065Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:17.0657505Z ################################################################################
2025-05-07T20:23:17.0657732Z 
2025-05-07T20:23:17.0673594Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.1556148Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.1556530Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.1556738Z 
2025-05-07T20:23:17.1574022Z 
2025-05-07T20:23:17.1574429Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.1595159Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.2520840Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.2521226Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.2521482Z 
2025-05-07T20:23:18.2666231Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.7180506Z Unpacking payload ...
2025-05-07T20:23:19.2355301Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:20.0341783Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.1371922Z 
2025-05-07T20:23:22.1372302Z Installing base environment...
2025-05-07T20:23:22.1372536Z 
2025-05-07T20:23:23.2092702Z Preparing transaction: ...working... done
2025-05-07T20:23:26.1758876Z Executing transaction: ...working... done
2025-05-07T20:23:26.8379664Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:26.9270312Z installation finished.
2025-05-07T20:23:26.9277087Z 
2025-05-07T20:23:26.9277397Z + rm -f miniconda.sh
2025-05-07T20:23:26.9277588Z 
2025-05-07T20:23:26.9583898Z 
2025-05-07T20:23:26.9584289Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:26.9584657Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:26.9584907Z 
2025-05-07T20:23:27.3233210Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.3233783Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.3234286Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.3234827Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.3235281Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.3235688Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.3236133Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.3236572Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.3237036Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.3237847Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.3238375Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.3238745Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.3238944Z 
2025-05-07T20:23:27.3239142Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.3239443Z 
2025-05-07T20:23:27.3909994Z 
2025-05-07T20:23:27.3910423Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.3910622Z 
2025-05-07T20:23:28.2290789Z 
2025-05-07T20:23:28.2291716Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.2316019Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.6993674Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:23:43.3069946Z Solving environment: \ | / - \ | / - \ | / - done
2025-05-07T20:23:43.4039499Z 
2025-05-07T20:23:43.4039635Z ## Package Plan ##
2025-05-07T20:23:43.4039786Z 
2025-05-07T20:23:43.4040038Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.4040482Z 
2025-05-07T20:23:43.4040616Z   added / updated specs:
2025-05-07T20:23:43.4040894Z     - conda-libmamba-solver
2025-05-07T20:23:43.4041148Z     - libarchive
2025-05-07T20:23:43.4041381Z     - libmamba
2025-05-07T20:23:43.4041599Z     - libmambapy
2025-05-07T20:23:43.4041730Z 
2025-05-07T20:23:43.4041735Z 
2025-05-07T20:23:43.4041879Z The following packages will be downloaded:
2025-05-07T20:23:43.4042103Z 
2025-05-07T20:23:43.4042222Z     package                    |            build
2025-05-07T20:23:43.4042552Z     ---------------------------|-----------------
2025-05-07T20:23:43.4042977Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.4043451Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.4043894Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.4044376Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.4044825Z     ------------------------------------------------------------
2025-05-07T20:23:43.4045174Z                                            Total:         1.4 MB
2025-05-07T20:23:43.4045391Z 
2025-05-07T20:23:43.4045506Z The following packages will be UPDATED:
2025-05-07T20:23:43.4045710Z 
2025-05-07T20:23:43.4049893Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.4050691Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.4051092Z 
2025-05-07T20:23:43.4051320Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.4051644Z 
2025-05-07T20:23:43.4051965Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.4052770Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.4053254Z 
2025-05-07T20:23:43.4053259Z 
2025-05-07T20:23:43.4053263Z 
2025-05-07T20:23:43.4053415Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.4053797Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.4054026Z 
2025-05-07T20:23:43.4054549Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.4054799Z 
2025-05-07T20:23:43.4054803Z 
2025-05-07T20:23:43.4071471Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.4071737Z 
2025-05-07T20:23:43.4071748Z 
2025-05-07T20:23:43.4072098Z 
2025-05-07T20:23:43.4588697Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.4590011Z 
2025-05-07T20:23:43.4693338Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.4693591Z 
2025-05-07T20:23:43.4697290Z 
2025-05-07T20:23:43.4781676Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.4781951Z 
2025-05-07T20:23:43.4781955Z 
2025-05-07T20:23:43.4784982Z 
2025-05-07T20:23:43.4795952Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.4798389Z 
2025-05-07T20:23:43.4970433Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.4970872Z 
2025-05-07T20:23:43.4970890Z 
2025-05-07T20:23:43.4977313Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.5055581Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.5055928Z 
2025-05-07T20:23:43.5055934Z 
2025-05-07T20:23:43.5056208Z 
2025-05-07T20:23:43.6094764Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.6095206Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.6101924Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.6102276Z                                                      
2025-05-07T20:23:43.6102564Z 
2025-05-07T20:23:43.6102831Z                                                      [A
2025-05-07T20:23:43.6103043Z 
2025-05-07T20:23:43.6103047Z 
2025-05-07T20:23:43.6103222Z                                                      [A[A
2025-05-07T20:23:43.6103440Z 
2025-05-07T20:23:43.6103444Z 
2025-05-07T20:23:43.6103447Z 
2025-05-07T20:23:43.6103629Z                                                      [A[A[A done
2025-05-07T20:23:43.7107091Z Preparing transaction: | done
2025-05-07T20:23:43.8110898Z Verifying transaction: - done
2025-05-07T20:23:45.2133699Z Executing transaction: | / - \ | / - \ | / - \ | / done
2025-05-07T20:23:46.9785618Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:46.9814547Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.9177302Z Channels:
2025-05-07T20:23:47.9177585Z  - defaults
2025-05-07T20:23:47.9177819Z Platform: linux-64
2025-05-07T20:23:49.1386550Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.2578868Z Solving environment: - \ Channels:
2025-05-07T20:23:49.2579298Z  - defaults
2025-05-07T20:23:49.2579615Z Platform: linux-64
2025-05-07T20:23:49.5502043Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:23:49.7607914Z Solving environment: - \ | / done
2025-05-07T20:23:49.8459278Z done
2025-05-07T20:23:49.9116280Z 
2025-05-07T20:23:49.9116590Z ## Package Plan ##
2025-05-07T20:23:49.9116754Z 
2025-05-07T20:23:49.9116906Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.9117147Z 
2025-05-07T20:23:49.9117261Z   added / updated specs:
2025-05-07T20:23:49.9117512Z     - conda
2025-05-07T20:23:49.9117630Z 
2025-05-07T20:23:49.9117634Z 
2025-05-07T20:23:49.9117764Z The following packages will be downloaded:
2025-05-07T20:23:49.9117987Z 
2025-05-07T20:23:49.9118112Z     package                    |            build
2025-05-07T20:23:49.9118432Z     ---------------------------|-----------------
2025-05-07T20:23:49.9118789Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.9119178Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.9119551Z     ------------------------------------------------------------
2025-05-07T20:23:49.9119908Z                                            Total:         1.4 MB
2025-05-07T20:23:49.9120122Z 
2025-05-07T20:23:49.9121001Z The following packages will be UPDATED:
2025-05-07T20:23:49.9121218Z 
2025-05-07T20:23:49.9121521Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.9122038Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.9122294Z 
2025-05-07T20:23:49.9122298Z 
2025-05-07T20:23:49.9122301Z 
2025-05-07T20:23:49.9122444Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.9122806Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.9123027Z 
2025-05-07T20:23:49.9392101Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:49.9392611Z 
2025-05-07T20:23:49.9677646Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1585906Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1586291Z 
2025-05-07T20:23:50.1588879Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1589268Z 
2025-05-07T20:23:50.1658389Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1659820Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1663608Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1664037Z                                                      
2025-05-07T20:23:50.1664302Z 
2025-05-07T20:23:50.1664677Z                                                      [A done
2025-05-07T20:23:50.2668866Z Preparing transaction: \ done
2025-05-07T20:23:50.3671715Z Verifying transaction: / done
2025-05-07T20:23:52.3775088Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:52.9892299Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:52.9897233Z + conda clean --packages --tarball -y
2025-05-07T20:23:52.9897455Z 
2025-05-07T20:23:53.9911861Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:53.9912235Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.0592123Z 
2025-05-07T20:23:54.0600388Z + conda clean --all -y
2025-05-07T20:23:54.0600568Z 
2025-05-07T20:23:54.6097727Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.6098123Z Will remove 1 index cache(s).
2025-05-07T20:23:54.6098420Z There are no unused package(s) to remove.
2025-05-07T20:23:54.6098746Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.6099051Z There are no logfile(s) to remove.
2025-05-07T20:23:54.6824919Z 
2025-05-07T20:23:54.6829599Z + conda info
2025-05-07T20:23:54.6829952Z 
2025-05-07T20:23:55.4276801Z 
2025-05-07T20:23:55.4277409Z      active environment : base
2025-05-07T20:23:55.4277781Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.4278118Z             shell level : 1
2025-05-07T20:23:55.4278405Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.4278795Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.4279171Z           conda version : 25.3.1
2025-05-07T20:23:55.4279482Z     conda-build version : not installed
2025-05-07T20:23:55.4279783Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.4280087Z                  solver : libmamba (default)
2025-05-07T20:23:55.4280415Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.4280726Z                           __conda=25.3.1=0
2025-05-07T20:23:55.4281015Z                           __cuda=12.8=0
2025-05-07T20:23:55.4281298Z                           __glibc=2.34=0
2025-05-07T20:23:55.4281582Z                           __linux=6.1.130=0
2025-05-07T20:23:55.4281857Z                           __unix=0=0
2025-05-07T20:23:55.4282197Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.4282606Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.4282950Z   conda av metadata url : None
2025-05-07T20:23:55.4283325Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.4284149Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.4284548Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.4284922Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.4285303Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.4285650Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.4285990Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.4286332Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.4286641Z                platform : linux-64
2025-05-07T20:23:55.4287466Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.4288279Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.4288561Z              netrc file : None
2025-05-07T20:23:55.4288833Z            offline mode : False
2025-05-07T20:23:55.4289003Z 
2025-05-07T20:23:55.4922965Z 
2025-05-07T20:23:55.4923448Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.4924189Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_d0c2c61f-7224-4a35-a81e-b4e14e84e54d ...
2025-05-07T20:23:55.4925704Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.5078854Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9
2025-05-07T20:23:55.5079344Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.9[0m
2025-05-07T20:23:55.5097134Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.5097493Z env:
2025-05-07T20:23:55.5097719Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.5098028Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.5098277Z   BUILD_TARGET: genai
2025-05-07T20:23:55.5098508Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.5098742Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.5099002Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.5099310Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.5099643Z ##[endgroup]
2025-05-07T20:23:55.8465834Z ################################################################################
2025-05-07T20:23:55.8466189Z # Create Conda Environment
2025-05-07T20:23:55.8466441Z #
2025-05-07T20:23:55.8481339Z # [2025-05-07T20:23:55.847Z] + create_conda_environment build_binary 3.9
2025-05-07T20:23:55.8481806Z ################################################################################
2025-05-07T20:23:55.8491154Z 
2025-05-07T20:23:55.8496149Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:55.9370489Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:55.9370863Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:55.9371192Z + conda info --envs
2025-05-07T20:23:55.9371332Z 
2025-05-07T20:23:56.6833877Z 
2025-05-07T20:23:56.6834617Z # conda environments:
2025-05-07T20:23:56.6834916Z #
2025-05-07T20:23:56.6835151Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.6835385Z 
2025-05-07T20:23:56.7517908Z 
2025-05-07T20:23:56.7518469Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.3886359Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.3886664Z 
2025-05-07T20:23:58.3900035Z 
2025-05-07T20:23:58.3909367Z [SETUP] Creating new Conda environment (Python 3.9) ...
2025-05-07T20:23:58.3931716Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.9
2025-05-07T20:23:59.1452066Z Channels:
2025-05-07T20:23:59.1452388Z  - defaults
2025-05-07T20:23:59.1452683Z Platform: linux-64
2025-05-07T20:24:00.4747875Z Collecting package metadata (repodata.json): - \ | / - \ | / done
2025-05-07T20:24:00.5753615Z Solving environment: \ done
2025-05-07T20:24:00.6039067Z 
2025-05-07T20:24:00.6039226Z ## Package Plan ##
2025-05-07T20:24:00.6039384Z 
2025-05-07T20:24:00.6039594Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:00.6039907Z 
2025-05-07T20:24:00.6040008Z   added / updated specs:
2025-05-07T20:24:00.6040576Z     - python=3.9
2025-05-07T20:24:00.6040827Z 
2025-05-07T20:24:00.6040833Z 
2025-05-07T20:24:00.6041001Z The following packages will be downloaded:
2025-05-07T20:24:00.6041288Z 
2025-05-07T20:24:00.6041459Z     package                    |            build
2025-05-07T20:24:00.6041884Z     ---------------------------|-----------------
2025-05-07T20:24:00.6042257Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:00.6042659Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:00.6043086Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:00.6043503Z     python-3.9.21              |       he870216_1        25.1 MB
2025-05-07T20:24:00.6043904Z     setuptools-78.1.1          |   py39h06a4308_0         1.7 MB
2025-05-07T20:24:00.6044303Z     wheel-0.45.1               |   py39h06a4308_0         114 KB
2025-05-07T20:24:00.6044674Z     ------------------------------------------------------------
2025-05-07T20:24:00.6045017Z                                            Total:        27.1 MB
2025-05-07T20:24:00.6045563Z 
2025-05-07T20:24:00.6045693Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:00.6045929Z 
2025-05-07T20:24:00.6046319Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:00.6046779Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:00.6047326Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:00.6047888Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:00.6048353Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:00.6048789Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.6049234Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:00.6049694Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.6050151Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:00.6050644Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:00.6051230Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:00.6051807Z   python             pkgs/main/linux-64::python-3.9.21-he870216_1 
2025-05-07T20:24:00.6052415Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:00.6053072Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 
2025-05-07T20:24:00.6053614Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:00.6054007Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:00.6054398Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:00.6054816Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 
2025-05-07T20:24:00.6055201Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:00.6055584Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:00.6055826Z 
2025-05-07T20:24:00.6055830Z 
2025-05-07T20:24:00.6055842Z 
2025-05-07T20:24:00.6055990Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:00.6056382Z python-3.9.21        | 25.1 MB   |            |   0% 
2025-05-07T20:24:00.6056609Z 
2025-05-07T20:24:00.6056935Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:00.6057184Z 
2025-05-07T20:24:00.6057188Z 
2025-05-07T20:24:00.6075198Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:00.6075576Z 
2025-05-07T20:24:00.6075582Z 
2025-05-07T20:24:00.6079803Z 
2025-05-07T20:24:00.6088769Z wheel-0.45.1         | 114 KB    |            |   0% [A[A[A
2025-05-07T20:24:00.6089131Z 
2025-05-07T20:24:00.6089137Z 
2025-05-07T20:24:00.6089142Z 
2025-05-07T20:24:00.6089147Z 
2025-05-07T20:24:00.6098346Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:00.6098733Z 
2025-05-07T20:24:00.6098751Z 
2025-05-07T20:24:00.6098756Z 
2025-05-07T20:24:00.6098761Z 
2025-05-07T20:24:00.6101051Z 
2025-05-07T20:24:00.6393038Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:00.6393462Z 
2025-05-07T20:24:00.6393468Z 
2025-05-07T20:24:00.6393473Z 
2025-05-07T20:24:00.6394171Z 
2025-05-07T20:24:00.6489855Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.6490243Z 
2025-05-07T20:24:00.6492869Z 
2025-05-07T20:24:00.6619471Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.6619853Z 
2025-05-07T20:24:00.6619859Z 
2025-05-07T20:24:00.6619881Z 
2025-05-07T20:24:00.6619886Z 
2025-05-07T20:24:00.6622955Z 
2025-05-07T20:24:00.6844985Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:00.6845366Z 
2025-05-07T20:24:00.6849316Z 
2025-05-07T20:24:00.6942876Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.6943267Z 
2025-05-07T20:24:00.6943694Z 
2025-05-07T20:24:00.6943700Z 
2025-05-07T20:24:00.6943703Z 
2025-05-07T20:24:00.6946561Z 
2025-05-07T20:24:00.7041755Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:00.7046649Z python-3.9.21        | 25.1 MB   | 8          |   8% 
2025-05-07T20:24:00.7046899Z 
2025-05-07T20:24:00.7094106Z setuptools-78.1.1    | 1.7 MB    | #######4   |  74% [A
2025-05-07T20:24:00.7094468Z 
2025-05-07T20:24:00.7094474Z 
2025-05-07T20:24:00.7094480Z 
2025-05-07T20:24:00.7095161Z 
2025-05-07T20:24:00.7099710Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.7100029Z 
2025-05-07T20:24:00.7100033Z 
2025-05-07T20:24:00.7100037Z 
2025-05-07T20:24:00.7100131Z 
2025-05-07T20:24:00.7135251Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.7137094Z 
2025-05-07T20:24:00.7435016Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:00.7435277Z 
2025-05-07T20:24:00.7435281Z 
2025-05-07T20:24:00.7435296Z 
2025-05-07T20:24:00.7488137Z wheel-0.45.1         | 114 KB    | #4         |  14% [A[A[A
2025-05-07T20:24:00.7488515Z 
2025-05-07T20:24:00.7488521Z 
2025-05-07T20:24:00.7490845Z 
2025-05-07T20:24:00.8042349Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:00.8209947Z python-3.9.21        | 25.1 MB   | ####9      |  50% 
2025-05-07T20:24:00.8210289Z 
2025-05-07T20:24:00.8210295Z 
2025-05-07T20:24:00.8210300Z 
2025-05-07T20:24:00.9278329Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:00.9281251Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.1683970Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.1684238Z 
2025-05-07T20:24:01.6407429Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.6414252Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.6414698Z                                                      
2025-05-07T20:24:01.6415029Z 
2025-05-07T20:24:01.6415295Z                                                      [A
2025-05-07T20:24:01.6415564Z 
2025-05-07T20:24:01.6415569Z 
2025-05-07T20:24:01.6415747Z                                                      [A[A
2025-05-07T20:24:01.6415977Z 
2025-05-07T20:24:01.6415981Z 
2025-05-07T20:24:01.6415985Z 
2025-05-07T20:24:01.6416157Z                                                      [A[A[A
2025-05-07T20:24:01.6416369Z 
2025-05-07T20:24:01.6416372Z 
2025-05-07T20:24:01.6416376Z 
2025-05-07T20:24:01.6416380Z 
2025-05-07T20:24:01.6416600Z                                                      [A[A[A[A
2025-05-07T20:24:01.6416877Z 
2025-05-07T20:24:01.6416881Z 
2025-05-07T20:24:01.6416884Z 
2025-05-07T20:24:01.6416888Z 
2025-05-07T20:24:01.6416892Z 
2025-05-07T20:24:01.6417089Z                                                      [A[A[A[A[A done
2025-05-07T20:24:01.8522798Z Preparing transaction: / - done
2025-05-07T20:24:02.9868051Z Verifying transaction: | / - \ | / - \ | / - done
2025-05-07T20:24:05.2056223Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:05.2561493Z #
2025-05-07T20:24:05.2561882Z # To activate this environment, use
2025-05-07T20:24:05.2562300Z #
2025-05-07T20:24:05.2562536Z #     $ conda activate build_binary
2025-05-07T20:24:05.2562821Z #
2025-05-07T20:24:05.2563051Z # To deactivate an active environment, use
2025-05-07T20:24:05.2563351Z #
2025-05-07T20:24:05.2563540Z #     $ conda deactivate
2025-05-07T20:24:05.2563709Z 
2025-05-07T20:24:05.3707639Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:05.3729693Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:08.1790268Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1)
2025-05-07T20:24:08.1791202Z Collecting pip
2025-05-07T20:24:08.1791669Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:08.1792645Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:08.1793871Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 113.9 MB/s eta 0:00:00
2025-05-07T20:24:08.1794675Z Installing collected packages: pip
2025-05-07T20:24:08.1795137Z   Attempting uninstall: pip
2025-05-07T20:24:08.1795571Z     Found existing installation: pip 25.1
2025-05-07T20:24:08.1796024Z     Uninstalling pip-25.1:
2025-05-07T20:24:08.1796443Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:08.1796927Z Successfully installed pip-25.1.1
2025-05-07T20:24:08.1797229Z 
2025-05-07T20:24:08.2428509Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:08.2451088Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:09.0957627Z Channels:
2025-05-07T20:24:09.0957892Z  - conda-forge
2025-05-07T20:24:09.0958136Z Platform: linux-64
2025-05-07T20:24:19.5247043Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:21.0403354Z Solving environment: / - \ | / done
2025-05-07T20:24:21.1020975Z 
2025-05-07T20:24:21.1021277Z ## Package Plan ##
2025-05-07T20:24:21.1021490Z 
2025-05-07T20:24:21.1021792Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:21.1022220Z 
2025-05-07T20:24:21.1022336Z   added / updated specs:
2025-05-07T20:24:21.1022619Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:21.1022826Z 
2025-05-07T20:24:21.1022830Z 
2025-05-07T20:24:21.1022955Z The following packages will be downloaded:
2025-05-07T20:24:21.1023174Z 
2025-05-07T20:24:21.1023302Z     package                    |            build
2025-05-07T20:24:21.1023661Z     ---------------------------|-----------------
2025-05-07T20:24:21.1024057Z     cffi-1.17.1                |   py39h15c3d72_0         236 KB  conda-forge
2025-05-07T20:24:21.1024700Z     cryptography-44.0.3        |   py39h7170ec2_0         1.5 MB  conda-forge
2025-05-07T20:24:21.1025159Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:21.1025584Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:21.1026021Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:21.1026440Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:21.1027005Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:21.1027548Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:21.1027991Z     python_abi-3.9             |           2_cp39           4 KB  conda-forge
2025-05-07T20:24:21.1028453Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:21.1028942Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:21.1029503Z     ------------------------------------------------------------
2025-05-07T20:24:21.1029872Z                                            Total:         6.3 MB
2025-05-07T20:24:21.1030085Z 
2025-05-07T20:24:21.1030215Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:21.1030446Z 
2025-05-07T20:24:21.1030641Z   cffi               conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 
2025-05-07T20:24:21.1031139Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 
2025-05-07T20:24:21.1031637Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:21.1032108Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:21.1032647Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:21.1033120Z   python_abi         conda-forge/linux-64::python_abi-3.9-2_cp39 
2025-05-07T20:24:21.1033639Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:21.1034224Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:21.1034812Z 
2025-05-07T20:24:21.1034931Z The following packages will be UPDATED:
2025-05-07T20:24:21.1035148Z 
2025-05-07T20:24:21.1035859Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:21.1036643Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:21.1037294Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:21.1037925Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:21.1038298Z 
2025-05-07T20:24:21.1038302Z 
2025-05-07T20:24:21.1038306Z 
2025-05-07T20:24:21.1038451Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:21.1038829Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:21.1039061Z 
2025-05-07T20:24:21.1039367Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:21.1039618Z 
2025-05-07T20:24:21.1039629Z 
2025-05-07T20:24:21.1052432Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:21.1052702Z 
2025-05-07T20:24:21.1052711Z 
2025-05-07T20:24:21.1052715Z 
2025-05-07T20:24:21.1060267Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:21.1060574Z 
2025-05-07T20:24:21.1060580Z 
2025-05-07T20:24:21.1060586Z 
2025-05-07T20:24:21.1064270Z 
2025-05-07T20:24:21.1080285Z cffi-1.17.1          | 236 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:21.1080546Z 
2025-05-07T20:24:21.1080552Z 
2025-05-07T20:24:21.1080558Z 
2025-05-07T20:24:21.1080564Z 
2025-05-07T20:24:21.1080569Z 
2025-05-07T20:24:21.1082092Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:21.1082375Z 
2025-05-07T20:24:21.1082380Z 
2025-05-07T20:24:21.1082384Z 
2025-05-07T20:24:21.1082389Z 
2025-05-07T20:24:21.1082394Z 
2025-05-07T20:24:21.1082424Z 
2025-05-07T20:24:21.1083712Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:21.1084049Z 
2025-05-07T20:24:21.1084055Z 
2025-05-07T20:24:21.1084061Z 
2025-05-07T20:24:21.1084084Z 
2025-05-07T20:24:21.1084091Z 
2025-05-07T20:24:21.1084096Z 
2025-05-07T20:24:21.1084101Z 
2025-05-07T20:24:21.1100685Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:21.1101001Z 
2025-05-07T20:24:21.1101006Z 
2025-05-07T20:24:21.1101009Z 
2025-05-07T20:24:21.1101013Z 
2025-05-07T20:24:21.1101017Z 
2025-05-07T20:24:21.1101021Z 
2025-05-07T20:24:21.1101024Z 
2025-05-07T20:24:21.1106889Z 
2025-05-07T20:24:21.1108265Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.1108571Z 
2025-05-07T20:24:21.1108575Z 
2025-05-07T20:24:21.1108579Z 
2025-05-07T20:24:21.1108582Z 
2025-05-07T20:24:21.1108593Z 
2025-05-07T20:24:21.1108597Z 
2025-05-07T20:24:21.1108600Z 
2025-05-07T20:24:21.1108620Z 
2025-05-07T20:24:21.1108624Z 
2025-05-07T20:24:21.1110070Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.1110354Z 
2025-05-07T20:24:21.1110358Z 
2025-05-07T20:24:21.1110362Z 
2025-05-07T20:24:21.1110373Z 
2025-05-07T20:24:21.1110377Z 
2025-05-07T20:24:21.1110380Z 
2025-05-07T20:24:21.1110384Z 
2025-05-07T20:24:21.1110388Z 
2025-05-07T20:24:21.1110391Z 
2025-05-07T20:24:21.1110395Z 
2025-05-07T20:24:21.1620164Z python_abi-3.9       | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.1620464Z 
2025-05-07T20:24:21.1620469Z 
2025-05-07T20:24:21.1620473Z 
2025-05-07T20:24:21.1620476Z 
2025-05-07T20:24:21.1815027Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.1815383Z 
2025-05-07T20:24:21.1815388Z 
2025-05-07T20:24:21.1815391Z 
2025-05-07T20:24:21.1815396Z 
2025-05-07T20:24:21.1931977Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.1932323Z 
2025-05-07T20:24:21.1932634Z 
2025-05-07T20:24:21.1932640Z 
2025-05-07T20:24:21.2036176Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.2036496Z 
2025-05-07T20:24:21.2048675Z cryptography-44.0.3  | 1.5 MB    | 5          |   5% [A
2025-05-07T20:24:21.2049057Z 
2025-05-07T20:24:21.2049062Z 
2025-05-07T20:24:21.2049071Z 
2025-05-07T20:24:21.2049075Z 
2025-05-07T20:24:21.2051450Z 
2025-05-07T20:24:21.2110102Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.2113882Z openssl-3.5.0        | 3.0 MB    |            |   1% 
2025-05-07T20:24:21.2114166Z 
2025-05-07T20:24:21.2116877Z 
2025-05-07T20:24:21.2339296Z libgcc-15.1.0        | 810 KB    | 1          |   2% [A[A
2025-05-07T20:24:21.2339627Z 
2025-05-07T20:24:21.2339631Z 
2025-05-07T20:24:21.2339635Z 
2025-05-07T20:24:21.2339639Z 
2025-05-07T20:24:21.2339708Z 
2025-05-07T20:24:21.2343686Z 
2025-05-07T20:24:21.2419540Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:21.2419869Z 
2025-05-07T20:24:21.2419873Z 
2025-05-07T20:24:21.2419877Z 
2025-05-07T20:24:21.2419881Z 
2025-05-07T20:24:21.2419885Z 
2025-05-07T20:24:21.2421701Z 
2025-05-07T20:24:21.2438081Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.2438381Z 
2025-05-07T20:24:21.2438385Z 
2025-05-07T20:24:21.2438389Z 
2025-05-07T20:24:21.2438393Z 
2025-05-07T20:24:21.2438397Z 
2025-05-07T20:24:21.2438400Z 
2025-05-07T20:24:21.2441576Z 
2025-05-07T20:24:21.2519645Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:21.2519977Z 
2025-05-07T20:24:21.2519981Z 
2025-05-07T20:24:21.2519985Z 
2025-05-07T20:24:21.2519989Z 
2025-05-07T20:24:21.2519992Z 
2025-05-07T20:24:21.2519996Z 
2025-05-07T20:24:21.2521329Z 
2025-05-07T20:24:21.2800856Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.2801179Z 
2025-05-07T20:24:21.2801183Z 
2025-05-07T20:24:21.2801187Z 
2025-05-07T20:24:21.2801205Z 
2025-05-07T20:24:21.2801217Z 
2025-05-07T20:24:21.2801220Z 
2025-05-07T20:24:21.2801224Z 
2025-05-07T20:24:21.2806385Z 
2025-05-07T20:24:21.2818205Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.2818635Z 
2025-05-07T20:24:21.2823906Z 
2025-05-07T20:24:21.2864705Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.2865102Z 
2025-05-07T20:24:21.2865108Z 
2025-05-07T20:24:21.2865114Z 
2025-05-07T20:24:21.2865128Z 
2025-05-07T20:24:21.2865133Z 
2025-05-07T20:24:21.2865138Z 
2025-05-07T20:24:21.2865144Z 
2025-05-07T20:24:21.2868119Z 
2025-05-07T20:24:21.2884841Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.2885161Z 
2025-05-07T20:24:21.2885166Z 
2025-05-07T20:24:21.2885170Z 
2025-05-07T20:24:21.2885173Z 
2025-05-07T20:24:21.2885177Z 
2025-05-07T20:24:21.2885181Z 
2025-05-07T20:24:21.2885185Z 
2025-05-07T20:24:21.2885188Z 
2025-05-07T20:24:21.2887936Z 
2025-05-07T20:24:21.2931794Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.2932224Z 
2025-05-07T20:24:21.2932230Z 
2025-05-07T20:24:21.2932235Z 
2025-05-07T20:24:21.2932240Z 
2025-05-07T20:24:21.2932245Z 
2025-05-07T20:24:21.2932264Z 
2025-05-07T20:24:21.2932270Z 
2025-05-07T20:24:21.2932276Z 
2025-05-07T20:24:21.2932281Z 
2025-05-07T20:24:21.3054233Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.3054580Z 
2025-05-07T20:24:21.3054584Z 
2025-05-07T20:24:21.3055218Z 
2025-05-07T20:24:21.3071666Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.3072065Z 
2025-05-07T20:24:21.3072071Z 
2025-05-07T20:24:21.3073395Z 
2025-05-07T20:24:21.3087881Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.3088218Z 
2025-05-07T20:24:21.3088224Z 
2025-05-07T20:24:21.3088229Z 
2025-05-07T20:24:21.3088235Z 
2025-05-07T20:24:21.3088240Z 
2025-05-07T20:24:21.3096366Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.3096946Z 
2025-05-07T20:24:21.3096953Z 
2025-05-07T20:24:21.3096958Z 
2025-05-07T20:24:21.3096964Z 
2025-05-07T20:24:21.3096969Z 
2025-05-07T20:24:21.3111925Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.3159521Z openssl-3.5.0        | 3.0 MB    | #########4 |  94% 
2025-05-07T20:24:21.3161545Z 
2025-05-07T20:24:21.3161930Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.3162222Z 
2025-05-07T20:24:21.3388091Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.3388425Z 
2025-05-07T20:24:21.3388430Z 
2025-05-07T20:24:21.3388436Z 
2025-05-07T20:24:21.3388441Z 
2025-05-07T20:24:21.3388446Z 
2025-05-07T20:24:21.3388451Z 
2025-05-07T20:24:21.3388457Z 
2025-05-07T20:24:21.3613302Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.3689112Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.3689362Z 
2025-05-07T20:24:21.3689372Z 
2025-05-07T20:24:21.3689376Z 
2025-05-07T20:24:21.3689380Z 
2025-05-07T20:24:21.3689383Z 
2025-05-07T20:24:21.3689387Z 
2025-05-07T20:24:21.3689391Z 
2025-05-07T20:24:21.3689920Z 
2025-05-07T20:24:21.3758181Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.3758506Z 
2025-05-07T20:24:21.3758510Z 
2025-05-07T20:24:21.3758514Z 
2025-05-07T20:24:21.3758518Z 
2025-05-07T20:24:21.3758521Z 
2025-05-07T20:24:21.3758525Z 
2025-05-07T20:24:21.3758529Z 
2025-05-07T20:24:21.3758532Z 
2025-05-07T20:24:21.3758536Z 
2025-05-07T20:24:21.3763766Z 
2025-05-07T20:24:21.3783743Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.3784031Z 
2025-05-07T20:24:21.3784036Z 
2025-05-07T20:24:21.3784039Z 
2025-05-07T20:24:21.3784043Z 
2025-05-07T20:24:21.3784047Z 
2025-05-07T20:24:21.3784050Z 
2025-05-07T20:24:21.3784054Z 
2025-05-07T20:24:21.3784058Z 
2025-05-07T20:24:21.3784061Z 
2025-05-07T20:24:21.3784074Z 
2025-05-07T20:24:21.4075164Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.4075457Z 
2025-05-07T20:24:21.4075461Z 
2025-05-07T20:24:21.4075464Z 
2025-05-07T20:24:21.4075478Z 
2025-05-07T20:24:21.4075482Z 
2025-05-07T20:24:21.4075486Z 
2025-05-07T20:24:21.4075490Z 
2025-05-07T20:24:21.4075494Z 
2025-05-07T20:24:21.4075497Z 
2025-05-07T20:24:21.4080444Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.4080790Z 
2025-05-07T20:24:21.4080795Z 
2025-05-07T20:24:21.4080800Z 
2025-05-07T20:24:21.4080805Z 
2025-05-07T20:24:21.4080819Z 
2025-05-07T20:24:21.4080824Z 
2025-05-07T20:24:21.4080829Z 
2025-05-07T20:24:21.4080834Z 
2025-05-07T20:24:21.4080839Z 
2025-05-07T20:24:21.4117371Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.4117647Z 
2025-05-07T20:24:21.4117753Z 
2025-05-07T20:24:21.4121183Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.4121441Z 
2025-05-07T20:24:21.4121894Z 
2025-05-07T20:24:21.4480978Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.4481227Z 
2025-05-07T20:24:21.4481231Z 
2025-05-07T20:24:21.4481245Z 
2025-05-07T20:24:21.4481249Z 
2025-05-07T20:24:21.4481260Z 
2025-05-07T20:24:21.4482421Z 
2025-05-07T20:24:21.4486245Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.4486591Z 
2025-05-07T20:24:21.4486595Z 
2025-05-07T20:24:21.4486605Z 
2025-05-07T20:24:21.4486609Z 
2025-05-07T20:24:21.4486613Z 
2025-05-07T20:24:21.4486617Z 
2025-05-07T20:24:21.4644213Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.4644497Z 
2025-05-07T20:24:21.4644501Z 
2025-05-07T20:24:21.4644504Z 
2025-05-07T20:24:21.4644508Z 
2025-05-07T20:24:21.4644512Z 
2025-05-07T20:24:21.4644515Z 
2025-05-07T20:24:21.4644519Z 
2025-05-07T20:24:21.4644523Z 
2025-05-07T20:24:21.4644526Z 
2025-05-07T20:24:21.4645417Z 
2025-05-07T20:24:21.5729424Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5905238Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.5905595Z 
2025-05-07T20:24:21.5914223Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.5914765Z                                                      
2025-05-07T20:24:21.5915048Z 
2025-05-07T20:24:21.5915292Z                                                      [A
2025-05-07T20:24:21.5915579Z 
2025-05-07T20:24:21.5915585Z 
2025-05-07T20:24:21.5915807Z                                                      [A[A
2025-05-07T20:24:21.5916120Z 
2025-05-07T20:24:21.5916126Z 
2025-05-07T20:24:21.5916131Z 
2025-05-07T20:24:21.5916371Z                                                      [A[A[A
2025-05-07T20:24:21.5916674Z 
2025-05-07T20:24:21.5916681Z 
2025-05-07T20:24:21.5916686Z 
2025-05-07T20:24:21.5916692Z 
2025-05-07T20:24:21.5916931Z                                                      [A[A[A[A
2025-05-07T20:24:21.5917254Z 
2025-05-07T20:24:21.5917260Z 
2025-05-07T20:24:21.5917265Z 
2025-05-07T20:24:21.5917270Z 
2025-05-07T20:24:21.5917276Z 
2025-05-07T20:24:21.5917526Z                                                      [A[A[A[A[A
2025-05-07T20:24:21.5917824Z 
2025-05-07T20:24:21.5917841Z 
2025-05-07T20:24:21.5917847Z 
2025-05-07T20:24:21.5917852Z 
2025-05-07T20:24:21.5917858Z 
2025-05-07T20:24:21.5917863Z 
2025-05-07T20:24:21.5918054Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:21.5918272Z 
2025-05-07T20:24:21.5918283Z 
2025-05-07T20:24:21.5918286Z 
2025-05-07T20:24:21.5918290Z 
2025-05-07T20:24:21.5918294Z 
2025-05-07T20:24:21.5918297Z 
2025-05-07T20:24:21.5918301Z 
2025-05-07T20:24:21.5918491Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:21.5918715Z 
2025-05-07T20:24:21.5918719Z 
2025-05-07T20:24:21.5918722Z 
2025-05-07T20:24:21.5918726Z 
2025-05-07T20:24:21.5918735Z 
2025-05-07T20:24:21.5918738Z 
2025-05-07T20:24:21.5918742Z 
2025-05-07T20:24:21.5918745Z 
2025-05-07T20:24:21.5918929Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5919154Z 
2025-05-07T20:24:21.5919161Z 
2025-05-07T20:24:21.5919165Z 
2025-05-07T20:24:21.5919168Z 
2025-05-07T20:24:21.5919172Z 
2025-05-07T20:24:21.5919176Z 
2025-05-07T20:24:21.5919179Z 
2025-05-07T20:24:21.5919183Z 
2025-05-07T20:24:21.5919187Z 
2025-05-07T20:24:21.5919369Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5919598Z 
2025-05-07T20:24:21.5919601Z 
2025-05-07T20:24:21.5919605Z 
2025-05-07T20:24:21.5919608Z 
2025-05-07T20:24:21.5919612Z 
2025-05-07T20:24:21.5919616Z 
2025-05-07T20:24:21.5919619Z 
2025-05-07T20:24:21.5919623Z 
2025-05-07T20:24:21.5919627Z 
2025-05-07T20:24:21.5919630Z 
2025-05-07T20:24:21.5919825Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:21.6920631Z Preparing transaction: \ done
2025-05-07T20:24:21.7925542Z Verifying transaction: / done
2025-05-07T20:24:23.2951806Z Executing transaction: \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:23.4752157Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:25.1967327Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:25.1979300Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:25.2002720Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:26.0644748Z Channels:
2025-05-07T20:24:26.0645021Z  - conda-forge
2025-05-07T20:24:26.0645309Z Platform: linux-64
2025-05-07T20:24:29.4120777Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:29.7807400Z Solving environment: \ done
2025-05-07T20:24:29.8415116Z 
2025-05-07T20:24:29.8415455Z ## Package Plan ##
2025-05-07T20:24:29.8415633Z 
2025-05-07T20:24:29.8415842Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:29.8416436Z 
2025-05-07T20:24:29.8416538Z   added / updated specs:
2025-05-07T20:24:29.8416802Z     - libxcrypt
2025-05-07T20:24:29.8416940Z 
2025-05-07T20:24:29.8416945Z 
2025-05-07T20:24:29.8417238Z The following packages will be downloaded:
2025-05-07T20:24:29.8417471Z 
2025-05-07T20:24:29.8417592Z     package                    |            build
2025-05-07T20:24:29.8417930Z     ---------------------------|-----------------
2025-05-07T20:24:29.8418323Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:29.8418737Z     ------------------------------------------------------------
2025-05-07T20:24:29.8419090Z                                            Total:          98 KB
2025-05-07T20:24:29.8419302Z 
2025-05-07T20:24:29.8419436Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:29.8419657Z 
2025-05-07T20:24:29.8419880Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:29.8420184Z 
2025-05-07T20:24:29.8420188Z 
2025-05-07T20:24:29.8420192Z 
2025-05-07T20:24:29.8420334Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:30.0010008Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:30.0027946Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:30.0130109Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.0132509Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.0132875Z                                                      
2025-05-07T20:24:30.0133164Z  done
2025-05-07T20:24:30.1138277Z Preparing transaction: / done
2025-05-07T20:24:30.2143365Z Verifying transaction: \ done
2025-05-07T20:24:30.3149672Z Executing transaction: / done
2025-05-07T20:24:33.7482165Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:33.7482892Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h
2025-05-07T20:24:33.7483466Z 
2025-05-07T20:24:33.7513035Z 
2025-05-07T20:24:35.3939973Z [SETUP] Installed Python version: Python 3.9.21
2025-05-07T20:24:35.3940757Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:35.3972981Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:35.3973445Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:35.3986307Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:35.3986661Z env:
2025-05-07T20:24:35.3986889Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:35.3987192Z   BUILD_ENV: build_binary
2025-05-07T20:24:35.3987441Z   BUILD_TARGET: genai
2025-05-07T20:24:35.3987676Z   BUILD_VARIANT: cuda
2025-05-07T20:24:35.3987924Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:35.3988219Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:35.3988538Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:35.3988878Z ##[endgroup]
2025-05-07T20:24:35.7370111Z ################################################################################
2025-05-07T20:24:35.7370572Z # Install C/C++ Compilers
2025-05-07T20:24:35.7370834Z #
2025-05-07T20:24:35.7384955Z # [2025-05-07T20:24:35.738Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:35.7385483Z ################################################################################
2025-05-07T20:24:35.7385712Z 
2025-05-07T20:24:35.7400085Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:35.8267846Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:35.8278539Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:35.8300099Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:36.6892927Z Channels:
2025-05-07T20:24:36.6893260Z  - conda-forge
2025-05-07T20:24:36.6893608Z Platform: linux-64
2025-05-07T20:24:40.0514183Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:40.4225716Z Solving environment: \ done
2025-05-07T20:24:40.4840791Z 
2025-05-07T20:24:40.4841202Z ## Package Plan ##
2025-05-07T20:24:40.4841452Z 
2025-05-07T20:24:40.4841793Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:40.4842260Z 
2025-05-07T20:24:40.4842404Z   added / updated specs:
2025-05-07T20:24:40.4842705Z     - sysroot_linux-64=2.17
2025-05-07T20:24:40.4842874Z 
2025-05-07T20:24:40.4842879Z 
2025-05-07T20:24:40.4843046Z The following packages will be downloaded:
2025-05-07T20:24:40.4843329Z 
2025-05-07T20:24:40.4843448Z     package                    |            build
2025-05-07T20:24:40.4843805Z     ---------------------------|-----------------
2025-05-07T20:24:40.4844442Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:40.4845220Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:40.4845851Z     ------------------------------------------------------------
2025-05-07T20:24:40.4846416Z                                            Total:        15.4 MB
2025-05-07T20:24:40.4846753Z 
2025-05-07T20:24:40.4846949Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:40.4847289Z 
2025-05-07T20:24:40.4847617Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:40.4848284Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:40.4848603Z 
2025-05-07T20:24:40.4848610Z 
2025-05-07T20:24:40.4848616Z 
2025-05-07T20:24:40.4848832Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:40.4849231Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:40.4849458Z 
2025-05-07T20:24:40.5864641Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:40.6251696Z sysroot_linux-64-2.1 | 14.5 MB   | 9          |  10% 
2025-05-07T20:24:40.6251946Z 
2025-05-07T20:24:40.6374887Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:40.6382465Z 
2025-05-07T20:24:40.6864621Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:40.7832008Z sysroot_linux-64-2.1 | 14.5 MB   | #####7     |  58% 
2025-05-07T20:24:40.8902935Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:40.8903206Z 
2025-05-07T20:24:40.8904748Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:40.8905017Z 
2025-05-07T20:24:41.3564726Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.3565171Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.3570983Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.3571535Z                                                      
2025-05-07T20:24:41.3571830Z 
2025-05-07T20:24:41.3572111Z                                                      [A done
2025-05-07T20:24:41.4578674Z Preparing transaction: / done
2025-05-07T20:24:41.6588864Z Verifying transaction: \ | done
2025-05-07T20:24:41.8634389Z Executing transaction: - \ done
2025-05-07T20:24:42.0224014Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:42.0224473Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:43.7176234Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:43.7191078Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:43.7214991Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:44.6097775Z Channels:
2025-05-07T20:24:44.6098275Z  - conda-forge
2025-05-07T20:24:44.6098751Z Platform: linux-64
2025-05-07T20:24:47.9774201Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:48.9442993Z Solving environment: \ | / - done
2025-05-07T20:24:49.0081965Z 
2025-05-07T20:24:49.0082369Z ## Package Plan ##
2025-05-07T20:24:49.0082623Z 
2025-05-07T20:24:49.0082996Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:49.0083886Z 
2025-05-07T20:24:49.0084074Z   added / updated specs:
2025-05-07T20:24:49.0084424Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:49.0084622Z 
2025-05-07T20:24:49.0084638Z 
2025-05-07T20:24:49.0084776Z The following packages will be downloaded:
2025-05-07T20:24:49.0084998Z 
2025-05-07T20:24:49.0085123Z     package                    |            build
2025-05-07T20:24:49.0085463Z     ---------------------------|-----------------
2025-05-07T20:24:49.0085883Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:49.0086373Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:49.0086858Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:49.0087327Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:49.0087789Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:49.0088242Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:49.0088689Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:49.0089173Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:49.0089662Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:49.0090114Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:49.0090603Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:49.0091094Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:49.0091506Z     ------------------------------------------------------------
2025-05-07T20:24:49.0091862Z                                            Total:        91.6 MB
2025-05-07T20:24:49.0092086Z 
2025-05-07T20:24:49.0092219Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:49.0092452Z 
2025-05-07T20:24:49.0092730Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:49.0093484Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:49.0094037Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:49.0094549Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:49.0095067Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:49.0095571Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:49.0096109Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.0096681Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:49.0097189Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:49.0097737Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.0098109Z 
2025-05-07T20:24:49.0098227Z The following packages will be UPDATED:
2025-05-07T20:24:49.0098448Z 
2025-05-07T20:24:49.0098768Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:49.0099489Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:49.0099894Z 
2025-05-07T20:24:49.0099898Z 
2025-05-07T20:24:49.0099902Z 
2025-05-07T20:24:49.0100048Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:49.0100442Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.0100684Z 
2025-05-07T20:24:49.0100953Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.0101294Z 
2025-05-07T20:24:49.0101298Z 
2025-05-07T20:24:49.0101976Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.0102336Z 
2025-05-07T20:24:49.0102341Z 
2025-05-07T20:24:49.0105079Z 
2025-05-07T20:24:49.0122811Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:49.0123116Z 
2025-05-07T20:24:49.0123122Z 
2025-05-07T20:24:49.0123134Z 
2025-05-07T20:24:49.0123138Z 
2025-05-07T20:24:49.0127867Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.0128260Z 
2025-05-07T20:24:49.0128266Z 
2025-05-07T20:24:49.0128272Z 
2025-05-07T20:24:49.0128288Z 
2025-05-07T20:24:49.0131027Z 
2025-05-07T20:24:49.0145755Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.0146042Z 
2025-05-07T20:24:49.0146054Z 
2025-05-07T20:24:49.0146159Z 
2025-05-07T20:24:49.0146165Z 
2025-05-07T20:24:49.0146169Z 
2025-05-07T20:24:49.0152292Z 
2025-05-07T20:24:49.0153993Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:49.0154429Z 
2025-05-07T20:24:49.0154435Z 
2025-05-07T20:24:49.0154454Z 
2025-05-07T20:24:49.0154470Z 
2025-05-07T20:24:49.0154475Z 
2025-05-07T20:24:49.0154481Z 
2025-05-07T20:24:49.0154487Z 
2025-05-07T20:24:49.0158993Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:49.0159295Z 
2025-05-07T20:24:49.0159299Z 
2025-05-07T20:24:49.0159303Z 
2025-05-07T20:24:49.0159306Z 
2025-05-07T20:24:49.0159310Z 
2025-05-07T20:24:49.0159314Z 
2025-05-07T20:24:49.0159318Z 
2025-05-07T20:24:49.0160739Z 
2025-05-07T20:24:49.0172283Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.0172650Z 
2025-05-07T20:24:49.0172656Z 
2025-05-07T20:24:49.0172662Z 
2025-05-07T20:24:49.0172667Z 
2025-05-07T20:24:49.0172672Z 
2025-05-07T20:24:49.0172678Z 
2025-05-07T20:24:49.0172683Z 
2025-05-07T20:24:49.0172688Z 
2025-05-07T20:24:49.0190750Z 
2025-05-07T20:24:49.0207330Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.0207748Z 
2025-05-07T20:24:49.0207754Z 
2025-05-07T20:24:49.0207769Z 
2025-05-07T20:24:49.0207775Z 
2025-05-07T20:24:49.0207780Z 
2025-05-07T20:24:49.0207786Z 
2025-05-07T20:24:49.0207791Z 
2025-05-07T20:24:49.0207797Z 
2025-05-07T20:24:49.0207802Z 
2025-05-07T20:24:49.0210542Z 
2025-05-07T20:24:49.0214261Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.0214656Z 
2025-05-07T20:24:49.0214660Z 
2025-05-07T20:24:49.0214663Z 
2025-05-07T20:24:49.0214667Z 
2025-05-07T20:24:49.0214671Z 
2025-05-07T20:24:49.0214674Z 
2025-05-07T20:24:49.0214678Z 
2025-05-07T20:24:49.0214681Z 
2025-05-07T20:24:49.0214694Z 
2025-05-07T20:24:49.0214697Z 
2025-05-07T20:24:49.0217299Z 
2025-05-07T20:24:49.1546266Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.1546696Z 
2025-05-07T20:24:49.1546715Z 
2025-05-07T20:24:49.1546720Z 
2025-05-07T20:24:49.1546762Z 
2025-05-07T20:24:49.1727639Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.1728312Z 
2025-05-07T20:24:49.1728732Z 
2025-05-07T20:24:49.1733757Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.1734077Z 
2025-05-07T20:24:49.1734081Z 
2025-05-07T20:24:49.1735835Z 
2025-05-07T20:24:49.2313635Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:49.2314031Z 
2025-05-07T20:24:49.2731057Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.2731380Z 
2025-05-07T20:24:49.2731921Z 
2025-05-07T20:24:49.2737441Z libstdcxx-devel_linu | 11.1 MB   | #####1     |  52% [A[A
2025-05-07T20:24:49.2737826Z 
2025-05-07T20:24:49.2737831Z 
2025-05-07T20:24:49.2738574Z 
2025-05-07T20:24:49.2813051Z binutils_impl_linux- | 6.0 MB    | #######4   |  75% [A[A[A
2025-05-07T20:24:49.3042576Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.3042825Z 
2025-05-07T20:24:49.3042830Z 
2025-05-07T20:24:49.3042888Z 
2025-05-07T20:24:49.3046129Z 
2025-05-07T20:24:49.3046820Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.3047322Z 
2025-05-07T20:24:49.3047333Z 
2025-05-07T20:24:49.3047337Z 
2025-05-07T20:24:49.3047341Z 
2025-05-07T20:24:49.3313467Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.3314019Z 
2025-05-07T20:24:49.3535195Z gxx_impl_linux-64-11 | 11.2 MB   | ###8       |  38% [A
2025-05-07T20:24:49.3535555Z 
2025-05-07T20:24:49.3535562Z 
2025-05-07T20:24:49.3535567Z 
2025-05-07T20:24:49.3535573Z 
2025-05-07T20:24:49.3537649Z 
2025-05-07T20:24:49.3733091Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.3733397Z 
2025-05-07T20:24:49.3736137Z 
2025-05-07T20:24:49.3815345Z libstdcxx-devel_linu | 11.1 MB   | ########7  |  88% [A[A
2025-05-07T20:24:49.4314051Z gcc_impl_linux-64-11 | 53.0 MB   | 3          |   4% 
2025-05-07T20:24:49.4315909Z 
2025-05-07T20:24:49.4536328Z gxx_impl_linux-64-11 | 11.2 MB   | #######1   |  71% [A
2025-05-07T20:24:49.4536601Z 
2025-05-07T20:24:49.4536845Z 
2025-05-07T20:24:49.4536887Z 
2025-05-07T20:24:49.4536893Z 
2025-05-07T20:24:49.4536964Z 
2025-05-07T20:24:49.4577236Z libsanitizer-11.4.0  | 3.5 MB    | #########1 |  92% [A[A[A[A[A
2025-05-07T20:24:49.4577635Z 
2025-05-07T20:24:49.4577653Z 
2025-05-07T20:24:49.4578027Z 
2025-05-07T20:24:49.4816644Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:49.5199490Z gcc_impl_linux-64-11 | 53.0 MB   | 9          |  10% 
2025-05-07T20:24:49.5199860Z 
2025-05-07T20:24:49.5199864Z 
2025-05-07T20:24:49.5199868Z 
2025-05-07T20:24:49.5199880Z 
2025-05-07T20:24:49.5199884Z 
2025-05-07T20:24:49.5202558Z 
2025-05-07T20:24:49.5315974Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:49.5320108Z 
2025-05-07T20:24:49.5821854Z gxx_impl_linux-64-11 | 11.2 MB   | #########9 | 100% [A
2025-05-07T20:24:49.5931631Z gcc_impl_linux-64-11 | 53.0 MB   | #5         |  16% 
2025-05-07T20:24:49.5931978Z 
2025-05-07T20:24:49.5931984Z 
2025-05-07T20:24:49.5931989Z 
2025-05-07T20:24:49.5932019Z 
2025-05-07T20:24:49.5932024Z 
2025-05-07T20:24:49.6345922Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:49.6346344Z 
2025-05-07T20:24:49.6346350Z 
2025-05-07T20:24:49.6346645Z 
2025-05-07T20:24:49.6346652Z 
2025-05-07T20:24:49.6346656Z 
2025-05-07T20:24:49.6346659Z 
2025-05-07T20:24:49.6346663Z 
2025-05-07T20:24:49.6790903Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:49.6791318Z 
2025-05-07T20:24:49.6791324Z 
2025-05-07T20:24:49.6791329Z 
2025-05-07T20:24:49.6791334Z 
2025-05-07T20:24:49.6791339Z 
2025-05-07T20:24:49.6791345Z 
2025-05-07T20:24:49.6791361Z 
2025-05-07T20:24:49.6821559Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:49.6947347Z gcc_impl_linux-64-11 | 53.0 MB   | ##1        |  21% 
2025-05-07T20:24:49.6947703Z 
2025-05-07T20:24:49.6947709Z 
2025-05-07T20:24:49.6947715Z 
2025-05-07T20:24:49.6947720Z 
2025-05-07T20:24:49.6947725Z 
2025-05-07T20:24:49.6947749Z 
2025-05-07T20:24:49.6948168Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:49.6948570Z 
2025-05-07T20:24:49.6948575Z 
2025-05-07T20:24:49.6948581Z 
2025-05-07T20:24:49.6948598Z 
2025-05-07T20:24:49.6948603Z 
2025-05-07T20:24:49.6948616Z 
2025-05-07T20:24:49.7088525Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:49.7088942Z 
2025-05-07T20:24:49.7088947Z 
2025-05-07T20:24:49.7088953Z 
2025-05-07T20:24:49.7088958Z 
2025-05-07T20:24:49.7088963Z 
2025-05-07T20:24:49.7088968Z 
2025-05-07T20:24:49.7088973Z 
2025-05-07T20:24:49.7088978Z 
2025-05-07T20:24:49.7130237Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7130655Z 
2025-05-07T20:24:49.7130661Z 
2025-05-07T20:24:49.7130666Z 
2025-05-07T20:24:49.7130671Z 
2025-05-07T20:24:49.7130676Z 
2025-05-07T20:24:49.7130690Z 
2025-05-07T20:24:49.7130696Z 
2025-05-07T20:24:49.7130701Z 
2025-05-07T20:24:49.7431528Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7432242Z 
2025-05-07T20:24:49.7432248Z 
2025-05-07T20:24:49.7432265Z 
2025-05-07T20:24:49.7432270Z 
2025-05-07T20:24:49.7432288Z 
2025-05-07T20:24:49.7432294Z 
2025-05-07T20:24:49.7432299Z 
2025-05-07T20:24:49.7432304Z 
2025-05-07T20:24:49.7432309Z 
2025-05-07T20:24:49.7467107Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7467522Z 
2025-05-07T20:24:49.7467528Z 
2025-05-07T20:24:49.7467534Z 
2025-05-07T20:24:49.7467539Z 
2025-05-07T20:24:49.7467545Z 
2025-05-07T20:24:49.7467550Z 
2025-05-07T20:24:49.7467555Z 
2025-05-07T20:24:49.7467562Z 
2025-05-07T20:24:49.7467568Z 
2025-05-07T20:24:49.7514825Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7515220Z 
2025-05-07T20:24:49.7515225Z 
2025-05-07T20:24:49.7515230Z 
2025-05-07T20:24:49.7515236Z 
2025-05-07T20:24:49.7515241Z 
2025-05-07T20:24:49.7515255Z 
2025-05-07T20:24:49.7515273Z 
2025-05-07T20:24:49.7515279Z 
2025-05-07T20:24:49.7515284Z 
2025-05-07T20:24:49.7519158Z 
2025-05-07T20:24:49.7551723Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7552135Z 
2025-05-07T20:24:49.7552141Z 
2025-05-07T20:24:49.7552146Z 
2025-05-07T20:24:49.7552151Z 
2025-05-07T20:24:49.7552157Z 
2025-05-07T20:24:49.7552162Z 
2025-05-07T20:24:49.7552167Z 
2025-05-07T20:24:49.7552172Z 
2025-05-07T20:24:49.7552178Z 
2025-05-07T20:24:49.7556865Z 
2025-05-07T20:24:49.7660006Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7660401Z 
2025-05-07T20:24:49.7660407Z 
2025-05-07T20:24:49.7825444Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:49.7916987Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  28% 
2025-05-07T20:24:49.7917341Z 
2025-05-07T20:24:49.7917347Z 
2025-05-07T20:24:49.7917352Z 
2025-05-07T20:24:49.7917357Z 
2025-05-07T20:24:49.7917363Z 
2025-05-07T20:24:49.7917387Z 
2025-05-07T20:24:49.7917393Z 
2025-05-07T20:24:49.7917398Z 
2025-05-07T20:24:49.7917403Z 
2025-05-07T20:24:49.7917408Z 
2025-05-07T20:24:49.7922369Z 
2025-05-07T20:24:49.7964449Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7964895Z 
2025-05-07T20:24:49.7964901Z 
2025-05-07T20:24:49.7964906Z 
2025-05-07T20:24:49.7964911Z 
2025-05-07T20:24:49.7964926Z 
2025-05-07T20:24:49.7964931Z 
2025-05-07T20:24:49.7964937Z 
2025-05-07T20:24:49.7964942Z 
2025-05-07T20:24:49.7964947Z 
2025-05-07T20:24:49.7964952Z 
2025-05-07T20:24:49.7964957Z 
2025-05-07T20:24:49.8184294Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.8184731Z 
2025-05-07T20:24:49.8184737Z 
2025-05-07T20:24:49.8184742Z 
2025-05-07T20:24:49.8184747Z 
2025-05-07T20:24:49.8467045Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.8467441Z 
2025-05-07T20:24:49.8828365Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:49.8959944Z gcc_impl_linux-64-11 | 53.0 MB   | ###7       |  38% 
2025-05-07T20:24:49.8960306Z 
2025-05-07T20:24:49.8960312Z 
2025-05-07T20:24:49.8960317Z 
2025-05-07T20:24:49.8960342Z 
2025-05-07T20:24:49.8960348Z 
2025-05-07T20:24:49.8960355Z 
2025-05-07T20:24:49.8960362Z 
2025-05-07T20:24:49.8966069Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:49.8966478Z 
2025-05-07T20:24:49.8966484Z 
2025-05-07T20:24:49.8966489Z 
2025-05-07T20:24:49.8966494Z 
2025-05-07T20:24:49.8966500Z 
2025-05-07T20:24:49.8966505Z 
2025-05-07T20:24:49.8966510Z 
2025-05-07T20:24:49.9829782Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:49.9971803Z gcc_impl_linux-64-11 | 53.0 MB   | ####6      |  46% 
2025-05-07T20:24:49.9972165Z 
2025-05-07T20:24:49.9972171Z 
2025-05-07T20:24:49.9972176Z 
2025-05-07T20:24:49.9972181Z 
2025-05-07T20:24:49.9972187Z 
2025-05-07T20:24:50.0377197Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.0377924Z 
2025-05-07T20:24:50.0377930Z 
2025-05-07T20:24:50.0377935Z 
2025-05-07T20:24:50.0377940Z 
2025-05-07T20:24:50.0377946Z 
2025-05-07T20:24:50.0377965Z 
2025-05-07T20:24:50.0478830Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.0479254Z 
2025-05-07T20:24:50.0479260Z 
2025-05-07T20:24:50.0479265Z 
2025-05-07T20:24:50.0479270Z 
2025-05-07T20:24:50.0479275Z 
2025-05-07T20:24:50.0479281Z 
2025-05-07T20:24:50.0479286Z 
2025-05-07T20:24:50.0479291Z 
2025-05-07T20:24:50.0486871Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0487243Z 
2025-05-07T20:24:50.0487248Z 
2025-05-07T20:24:50.0487251Z 
2025-05-07T20:24:50.0487255Z 
2025-05-07T20:24:50.0487259Z 
2025-05-07T20:24:50.0487263Z 
2025-05-07T20:24:50.0487266Z 
2025-05-07T20:24:50.0487780Z 
2025-05-07T20:24:50.0872755Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0873190Z 
2025-05-07T20:24:50.0873194Z 
2025-05-07T20:24:50.0873198Z 
2025-05-07T20:24:50.0873213Z 
2025-05-07T20:24:50.0873217Z 
2025-05-07T20:24:50.0873220Z 
2025-05-07T20:24:50.0873236Z 
2025-05-07T20:24:50.0873240Z 
2025-05-07T20:24:50.0873243Z 
2025-05-07T20:24:50.0875128Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0875508Z 
2025-05-07T20:24:50.0875513Z 
2025-05-07T20:24:50.0875517Z 
2025-05-07T20:24:50.0875520Z 
2025-05-07T20:24:50.0875524Z 
2025-05-07T20:24:50.0875528Z 
2025-05-07T20:24:50.0875531Z 
2025-05-07T20:24:50.0875535Z 
2025-05-07T20:24:50.0875539Z 
2025-05-07T20:24:50.0971509Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0971916Z 
2025-05-07T20:24:50.0971922Z 
2025-05-07T20:24:50.0971927Z 
2025-05-07T20:24:50.0971933Z 
2025-05-07T20:24:50.0971938Z 
2025-05-07T20:24:50.0971943Z 
2025-05-07T20:24:50.0971948Z 
2025-05-07T20:24:50.0971953Z 
2025-05-07T20:24:50.0971969Z 
2025-05-07T20:24:50.0971974Z 
2025-05-07T20:24:50.0979414Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0979845Z 
2025-05-07T20:24:50.0980097Z 
2025-05-07T20:24:50.0980105Z 
2025-05-07T20:24:50.0980110Z 
2025-05-07T20:24:50.0980115Z 
2025-05-07T20:24:50.0980121Z 
2025-05-07T20:24:50.0980126Z 
2025-05-07T20:24:50.0980131Z 
2025-05-07T20:24:50.0980137Z 
2025-05-07T20:24:50.0980142Z 
2025-05-07T20:24:50.1472269Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1472696Z 
2025-05-07T20:24:50.1472702Z 
2025-05-07T20:24:50.1472707Z 
2025-05-07T20:24:50.1472711Z 
2025-05-07T20:24:50.1472716Z 
2025-05-07T20:24:50.1472721Z 
2025-05-07T20:24:50.1472736Z 
2025-05-07T20:24:50.1472740Z 
2025-05-07T20:24:50.1472744Z 
2025-05-07T20:24:50.1472748Z 
2025-05-07T20:24:50.1472752Z 
2025-05-07T20:24:50.1476193Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1476664Z 
2025-05-07T20:24:50.1476670Z 
2025-05-07T20:24:50.1476675Z 
2025-05-07T20:24:50.1476680Z 
2025-05-07T20:24:50.1476686Z 
2025-05-07T20:24:50.1476691Z 
2025-05-07T20:24:50.1476697Z 
2025-05-07T20:24:50.1476711Z 
2025-05-07T20:24:50.1476715Z 
2025-05-07T20:24:50.1476718Z 
2025-05-07T20:24:50.1476722Z 
2025-05-07T20:24:50.1981542Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2982117Z gcc_impl_linux-64-11 | 53.0 MB   | #####3     |  54% 
2025-05-07T20:24:50.3090298Z gcc_impl_linux-64-11 | 53.0 MB   | ######4    |  65% 
2025-05-07T20:24:50.3090653Z 
2025-05-07T20:24:50.3090659Z 
2025-05-07T20:24:50.3092554Z 
2025-05-07T20:24:50.3996758Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.4768659Z gcc_impl_linux-64-11 | 53.0 MB   | #######2   |  72% 
2025-05-07T20:24:50.4769289Z 
2025-05-07T20:24:50.4995870Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.6024432Z gcc_impl_linux-64-11 | 53.0 MB   | ########3  |  83% 
2025-05-07T20:24:50.6599640Z gcc_impl_linux-64-11 | 53.0 MB   | #########3 |  93% 
2025-05-07T20:24:50.6600116Z 
2025-05-07T20:24:50.6600121Z 
2025-05-07T20:24:50.7285980Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.3198595Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:51.3206365Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:51.3206786Z                                                      
2025-05-07T20:24:51.3207127Z 
2025-05-07T20:24:51.3207383Z                                                      [A
2025-05-07T20:24:51.3207709Z 
2025-05-07T20:24:51.3207713Z 
2025-05-07T20:24:51.3207986Z                                                      [A[A
2025-05-07T20:24:51.3208225Z 
2025-05-07T20:24:51.3208229Z 
2025-05-07T20:24:51.3208233Z 
2025-05-07T20:24:51.3208672Z                                                      [A[A[A
2025-05-07T20:24:51.3208901Z 
2025-05-07T20:24:51.3208919Z 
2025-05-07T20:24:51.3208923Z 
2025-05-07T20:24:51.3208926Z 
2025-05-07T20:24:51.3209622Z                                                      [A[A[A[A
2025-05-07T20:24:51.3209870Z 
2025-05-07T20:24:51.3209896Z 
2025-05-07T20:24:51.3209900Z 
2025-05-07T20:24:51.3209903Z 
2025-05-07T20:24:51.3209907Z 
2025-05-07T20:24:51.3210525Z                                                      [A[A[A[A[A
2025-05-07T20:24:51.3210776Z 
2025-05-07T20:24:51.3210780Z 
2025-05-07T20:24:51.3210783Z 
2025-05-07T20:24:51.3210787Z 
2025-05-07T20:24:51.3210791Z 
2025-05-07T20:24:51.3210795Z 
2025-05-07T20:24:51.3211090Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:51.3211351Z 
2025-05-07T20:24:51.3211354Z 
2025-05-07T20:24:51.3211358Z 
2025-05-07T20:24:51.3211362Z 
2025-05-07T20:24:51.3211366Z 
2025-05-07T20:24:51.3211369Z 
2025-05-07T20:24:51.3211373Z 
2025-05-07T20:24:51.3211681Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:51.3211938Z 
2025-05-07T20:24:51.3211942Z 
2025-05-07T20:24:51.3211946Z 
2025-05-07T20:24:51.3211949Z 
2025-05-07T20:24:51.3211953Z 
2025-05-07T20:24:51.3211957Z 
2025-05-07T20:24:51.3211960Z 
2025-05-07T20:24:51.3212171Z 
2025-05-07T20:24:51.3212414Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3212719Z 
2025-05-07T20:24:51.3212722Z 
2025-05-07T20:24:51.3212726Z 
2025-05-07T20:24:51.3212730Z 
2025-05-07T20:24:51.3212733Z 
2025-05-07T20:24:51.3212737Z 
2025-05-07T20:24:51.3212741Z 
2025-05-07T20:24:51.3212744Z 
2025-05-07T20:24:51.3212748Z 
2025-05-07T20:24:51.3213037Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3213287Z 
2025-05-07T20:24:51.3213290Z 
2025-05-07T20:24:51.3213294Z 
2025-05-07T20:24:51.3213298Z 
2025-05-07T20:24:51.3213301Z 
2025-05-07T20:24:51.3213305Z 
2025-05-07T20:24:51.3213342Z 
2025-05-07T20:24:51.3213346Z 
2025-05-07T20:24:51.3213349Z 
2025-05-07T20:24:51.3213353Z 
2025-05-07T20:24:51.3213575Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3213829Z 
2025-05-07T20:24:51.3213833Z 
2025-05-07T20:24:51.3213837Z 
2025-05-07T20:24:51.3213916Z 
2025-05-07T20:24:51.3213920Z 
2025-05-07T20:24:51.3213924Z 
2025-05-07T20:24:51.3213928Z 
2025-05-07T20:24:51.3213932Z 
2025-05-07T20:24:51.3213935Z 
2025-05-07T20:24:51.3213939Z 
2025-05-07T20:24:51.3213942Z 
2025-05-07T20:24:51.3214219Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:51.4215184Z Preparing transaction: | done
2025-05-07T20:24:51.7222398Z Verifying transaction: - \ | done
2025-05-07T20:24:51.8230974Z Executing transaction: - done
2025-05-07T20:24:51.9877936Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:55.8957280Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:55.8957861Z 
2025-05-07T20:24:55.8970510Z 
2025-05-07T20:24:55.8986481Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:55.8987148Z 
2025-05-07T20:24:55.9001819Z 
2025-05-07T20:24:55.9019697Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:55.9020258Z 
2025-05-07T20:24:55.9032694Z 
2025-05-07T20:24:55.9050774Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:55.9051341Z 
2025-05-07T20:24:55.9062295Z 
2025-05-07T20:24:57.7900022Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:57.7900333Z 
2025-05-07T20:24:57.8527692Z [CHECK] Binary cc found in PATH
2025-05-07T20:24:59.7376333Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:59.7376677Z 
2025-05-07T20:24:59.8031130Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:01.6866573Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:01.6866933Z 
2025-05-07T20:25:01.7519119Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:03.6405087Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:03.6405489Z 
2025-05-07T20:25:03.7059335Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:03.7063133Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:03.7063817Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:03.7064057Z 
2025-05-07T20:25:05.6080407Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6081062Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:05.6081700Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:05.6082180Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6082701Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:05.6083259Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:05.6083987Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:05.6084898Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:05.6085282Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:05.6085955Z #define __CHAR_BIT__ 8
2025-05-07T20:25:05.6086646Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:05.6087513Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:05.6087915Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:05.6088844Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:05.6089834Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:05.6090433Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6090819Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:05.6091850Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:05.6092479Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:05.6092986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:05.6093726Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:05.6094395Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:05.6095262Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:05.6095736Z #define __GCC_IEC_559 2
2025-05-07T20:25:05.6096077Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6096556Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:05.6096888Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6097274Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:05.6098090Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6098477Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:05.6098862Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6099323Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:05.6099682Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:05.6100017Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:05.6100451Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:05.6100834Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:05.6101282Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:05.6101699Z #define __INT8_C(c) c
2025-05-07T20:25:05.6102055Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:05.6102639Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6103130Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:05.6103567Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.6104075Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6104429Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.6104809Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6105250Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:05.6105622Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:05.6106117Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:05.6106698Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:05.6107086Z #define __linux 1
2025-05-07T20:25:05.6107414Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:05.6107908Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:05.6108315Z #define __unix 1
2025-05-07T20:25:05.6108602Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:05.6109057Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6109467Z #define __WINT_MIN__ 0U
2025-05-07T20:25:05.6109779Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6110225Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:05.6110634Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:05.6110966Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:05.6111399Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:05.6111806Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:05.6112202Z #define __INT64_C(c) c ## L
2025-05-07T20:25:05.6112607Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:05.6113025Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:05.6113388Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:05.6113858Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:05.6114343Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:05.6114716Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:05.6115073Z #define __DBL_DIG__ 15
2025-05-07T20:25:05.6115421Z #define __FLT32_DIG__ 6
2025-05-07T20:25:05.6115853Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:05.6116341Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:05.6116766Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:05.6117218Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:05.6117760Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:05.6118075Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.6118458Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:05.6118979Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:05.6119462Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:05.6119839Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:05.6120244Z #define __unix__ 1
2025-05-07T20:25:05.6120586Z #define __INT_WIDTH__ 32
2025-05-07T20:25:05.6120892Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:05.6121279Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:05.6121650Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:05.6121986Z #define __UINT16_C(c) c
2025-05-07T20:25:05.6122369Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:05.6122751Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:05.6123184Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:05.6123719Z #define __gnu_linux__ 1
2025-05-07T20:25:05.6124062Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:05.6124424Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6124861Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6125232Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:05.6125574Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:05.6125992Z #define __GNUC__ 11
2025-05-07T20:25:05.6126306Z #define __pie__ 2
2025-05-07T20:25:05.6126605Z #define __MMX__ 1
2025-05-07T20:25:05.6126993Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:05.6127361Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:05.6127772Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:05.6128193Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:05.6128745Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.6129248Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6129744Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.6130073Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:05.6130441Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:05.6130914Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:05.6131243Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:05.6131613Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:05.6132093Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:05.6132474Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:05.6132858Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:05.6145006Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:05.6145314Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:05.6145597Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:05.6145899Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:05.6146182Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:05.6146455Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:05.6146802Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.6147187Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:05.6147478Z #define __SSE2_MATH__ 1
2025-05-07T20:25:05.6147745Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:05.6148072Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6148376Z #define __amd64 1
2025-05-07T20:25:05.6148622Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:05.6148906Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:05.6149230Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:05.6149549Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:05.6149819Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6150088Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:05.6150350Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:05.6150623Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:05.6150888Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:05.6151170Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:05.6151450Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:05.6151735Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:05.6152310Z #define __x86_64 1
2025-05-07T20:25:05.6152574Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:05.6152953Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:05.6153437Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:05.6153902Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:05.6154377Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.6154762Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:05.6155028Z #define __LP64__ 1
2025-05-07T20:25:05.6155274Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6155632Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:05.6156023Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:05.6156319Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6156603Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6156911Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:05.6157203Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:05.6157494Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:05.6157762Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:05.6158046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:05.6158322Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.6158654Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:05.6159027Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:05.6159312Z #define __FLT_DIG__ 6
2025-05-07T20:25:05.6159547Z #define __NO_INLINE__ 1
2025-05-07T20:25:05.6159800Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:05.6160135Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:05.6160486Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:05.6160918Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:05.6161187Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:05.6161441Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:05.6161707Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:05.6161980Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:05.6162284Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:05.6162567Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:05.6162845Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:05.6163156Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.6163484Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:05.6163755Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:05.6164024Z #define __FLT128_DIG__ 33
2025-05-07T20:25:05.6164264Z #define __INT32_C(c) c
2025-05-07T20:25:05.6164520Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:05.6164811Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:05.6165088Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:05.6165379Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:05.6165708Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:05.6166013Z #define unix 1
2025-05-07T20:25:05.6166257Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:05.6166584Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6166897Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:05.6167211Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:05.6167552Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:05.6167816Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:05.6168084Z #define __ELF__ 1
2025-05-07T20:25:05.6168321Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:05.6168611Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:05.6168886Z #define __FLT_RADIX__ 2
2025-05-07T20:25:05.6169145Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:05.6169514Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:05.6169879Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:05.6170143Z #define __SSE_MATH__ 1
2025-05-07T20:25:05.6170381Z #define __k8 1
2025-05-07T20:25:05.6170677Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:05.6171154Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:05.6171460Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:05.6171768Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:05.6172028Z #define __LDBL_DIG__ 18
2025-05-07T20:25:05.6172280Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:05.6172545Z #define __x86_64__ 1
2025-05-07T20:25:05.6172783Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6173089Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:05.6173435Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6173743Z #define __FLT64_DIG__ 15
2025-05-07T20:25:05.6174038Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6174395Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.6174710Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6174985Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:05.6175271Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6175573Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:05.6175944Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:05.6176344Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:05.6176641Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:05.6176977Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:05.6177310Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:05.6177664Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:05.6177944Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:05.6178259Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:05.6178544Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:05.6178781Z #define __SEG_FS 1
2025-05-07T20:25:05.6179016Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:05.6179296Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:05.6179572Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6179988Z #define __SEG_GS 1
2025-05-07T20:25:05.6180307Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:05.6180694Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:05.6180971Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:05.6181352Z #define __INT16_TYPE__ short int
2025-05-07T20:25:05.6181637Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:05.6181931Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:05.6182201Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:05.6182454Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:05.6182713Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:05.6183058Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.6183449Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6183738Z #define linux 1
2025-05-07T20:25:05.6183970Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6184252Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.6184540Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:05.6184789Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:05.6185055Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:05.6185326Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:05.6185671Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.6186091Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:05.6186432Z #define __code_model_small__ 1
2025-05-07T20:25:05.6186706Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:05.6187000Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:05.6187256Z #define __k8__ 1
2025-05-07T20:25:05.6187487Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:05.6187785Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:05.6188095Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:05.6188338Z #define __pic__ 2
2025-05-07T20:25:05.6188600Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6188918Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:05.6189225Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6189554Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:05.6190025Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.6190389Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:05.6190660Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:05.6190962Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.6191279Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:05.6191531Z #define __linux__ 1
2025-05-07T20:25:05.6191764Z #define __INT64_TYPE__ long int
2025-05-07T20:25:05.6192034Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:05.6192297Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:05.6192577Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:05.6192839Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:05.6193134Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6193474Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:05.6193775Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:05.6194054Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:05.6194350Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:05.6194655Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:05.6194993Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.6195352Z #define __SSE__ 1
2025-05-07T20:25:05.6195591Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6195937Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.6196280Z #define __amd64__ 1
2025-05-07T20:25:05.6196509Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:05.6196768Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:05.6197038Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:05.6197324Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:05.6197639Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:05.6197929Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:05.6198187Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:05.6198562Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6198841Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:05.6199193Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:05.6199672Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:05.6200035Z #define _LP64 1
2025-05-07T20:25:05.6200251Z #define __UINT8_C(c) c
2025-05-07T20:25:05.6200502Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:05.6200777Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:05.6201049Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:05.6201331Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:05.6201640Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:05.6202004Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.6202464Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.6202841Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6203142Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6203457Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:05.6203831Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:05.6204205Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:05.6204473Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:05.6204815Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:05.6205184Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6205445Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:05.6205701Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:05.6205963Z #define __FXSR__ 1
2025-05-07T20:25:05.6206269Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.6206730Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.6207148Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.6207458Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:05.6207769Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:05.6208118Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:05.6208482Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:05.6208729Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:05.6209072Z #define __PIC__ 2
2025-05-07T20:25:05.6209341Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:05.6209741Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.6210139Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:05.6210481Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.6210809Z #define __SSE2__ 1
2025-05-07T20:25:05.6211049Z #define __INT32_TYPE__ int
2025-05-07T20:25:05.6211309Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:05.6211569Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.6211917Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:05.6212282Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:05.6212564Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:05.6212842Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:05.6213116Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6213391Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:05.6213641Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:05.6213893Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:05.6214187Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6214484Z #define __PIE__ 2
2025-05-07T20:25:05.6214811Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:05.6215204Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:05.6215549Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:05.6215921Z #define __INT16_C(c) c
2025-05-07T20:25:05.6216153Z #define __STDC__ 1
2025-05-07T20:25:05.6216383Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:05.6216666Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:05.6216926Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6217232Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:05.6217667Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:05.6218003Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:05.6218278Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6218562Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:05.6218831Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:05.6219122Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:05.6219409Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6219692Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:05.6219997Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6220394Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.6220773Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:05.6221177Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:05.6221479Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:05.6221730Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:05.6221898Z 
2025-05-07T20:25:05.6750256Z 
2025-05-07T20:25:05.6751055Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:05.6751524Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:05.6751789Z 
2025-05-07T20:25:07.5747491Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5748071Z #define __cpp_attributes 200809L
2025-05-07T20:25:07.5748529Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:07.5749024Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:07.5749422Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:07.5749781Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5750241Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:07.5750604Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:07.5750888Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:07.5751206Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:07.5751525Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:07.5751799Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:07.5752082Z #define __CHAR_BIT__ 8
2025-05-07T20:25:07.5752329Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:07.5752589Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:07.5753201Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:07.5753494Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:07.5753782Z #define __cpp_static_assert 201411L
2025-05-07T20:25:07.5754076Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:07.5754385Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5754698Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:07.5754990Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:07.5755329Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:07.5755663Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:07.5756070Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:07.5756488Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:07.5756814Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:07.5757106Z #define __GCC_IEC_559 2
2025-05-07T20:25:07.5757356Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5757641Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:07.5757933Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:07.5758265Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5758569Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:07.5758897Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:07.5759207Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:07.5759547Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5759876Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:07.5760148Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5760433Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:07.5760721Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:07.5761026Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:07.5761291Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:07.5761559Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:07.5762008Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:07.5762341Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:07.5762688Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:07.5762952Z #define __INT8_C(c) c
2025-05-07T20:25:07.5763190Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:07.5763473Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:07.5763802Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5764126Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:07.5764411Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:07.5764709Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:07.5765026Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.5765389Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5765680Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:07.5765965Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.5766236Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5766526Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:07.5766812Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:07.5767213Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:07.5767633Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:07.5767931Z #define __linux 1
2025-05-07T20:25:07.5768163Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:07.5768453Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:07.5768739Z #define __unix 1
2025-05-07T20:25:07.5768967Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:07.5769261Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:07.5769555Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5769835Z #define __WINT_MIN__ 0U
2025-05-07T20:25:07.5770085Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5770381Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:07.5770663Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:07.5770931Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:07.5771195Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:07.5771485Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:07.5771785Z #define __INT64_C(c) c ## L
2025-05-07T20:25:07.5772156Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:07.5772465Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:07.5772743Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:07.5773053Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:07.5773341Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:07.5773605Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:07.5773963Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:07.5774346Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:07.5774608Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:07.5774889Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:07.5775171Z #define __DBL_DIG__ 15
2025-05-07T20:25:07.5775409Z #define __FLT32_DIG__ 6
2025-05-07T20:25:07.5775711Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:07.5776068Z #define __GXX_WEAK__ 1
2025-05-07T20:25:07.5776310Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:07.5776557Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:07.5776900Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:07.5777255Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.5777521Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:07.5777828Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:07.5778161Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:07.5778572Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:07.5778968Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:07.5779248Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:07.5779517Z #define __unix__ 1
2025-05-07T20:25:07.5779741Z #define __INT_WIDTH__ 32
2025-05-07T20:25:07.5779990Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:07.5780242Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:07.5780582Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:07.5780855Z #define __UINT16_C(c) c
2025-05-07T20:25:07.5781250Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:07.5781555Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:07.5781922Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:07.5782293Z #define __gnu_linux__ 1
2025-05-07T20:25:07.5782533Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:07.5782801Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5783089Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5793030Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5793330Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:07.5793615Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:07.5793885Z #define __GNUC__ 11
2025-05-07T20:25:07.5794110Z #define __GXX_RTTI 1
2025-05-07T20:25:07.5794355Z #define __pie__ 2
2025-05-07T20:25:07.5794587Z #define __MMX__ 1
2025-05-07T20:25:07.5794816Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:07.5795101Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:07.5795408Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:07.5795694Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:07.5795957Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:07.5796287Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:07.5796628Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:07.5796985Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.5797369Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:07.5797682Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5798009Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.5798283Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:07.5798555Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:07.5798872Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:07.5799177Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:07.5799455Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:07.5799717Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:07.5800014Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:07.5800316Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:07.5800587Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:07.5801061Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:07.5801327Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:07.5801592Z #define __cplusplus 201703L
2025-05-07T20:25:07.5801868Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:07.5802159Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:07.5802417Z #define __DEPRECATED 1
2025-05-07T20:25:07.5802679Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:07.5802980Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:07.5803237Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:07.5803561Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.5803927Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:07.5804204Z #define __SSE2_MATH__ 1
2025-05-07T20:25:07.5804452Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:07.5804769Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5805069Z #define __amd64 1
2025-05-07T20:25:07.5805294Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:07.5805572Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:07.5805844Z #define __GNUG__ 11
2025-05-07T20:25:07.5806102Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:07.5806419Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:07.5806683Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:07.5806941Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5807225Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:07.5807487Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:07.5807762Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:07.5808066Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:07.5808341Z #define __cpp_hex_float 201603L
2025-05-07T20:25:07.5808610Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:07.5808885Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:07.5809169Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:07.5809601Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:07.5809874Z #define __x86_64 1
2025-05-07T20:25:07.5810106Z #define __cpp_lambdas 200907L
2025-05-07T20:25:07.5810389Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:07.5810764Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:07.5811160Z #define __cpp_template_auto 201606L
2025-05-07T20:25:07.5811527Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:07.5811980Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:07.5812463Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.5812862Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:07.5813125Z #define __LP64__ 1
2025-05-07T20:25:07.5813355Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5813716Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:07.5814098Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:07.5814381Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5814670Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:07.5814951Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:07.5815228Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:07.5815491Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:07.5815761Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.5816090Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:07.5816454Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:07.5816735Z #define __FLT_DIG__ 6
2025-05-07T20:25:07.5816966Z #define __NO_INLINE__ 1
2025-05-07T20:25:07.5817214Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:07.5817546Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:07.5817901Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:07.5818160Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:07.5818428Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:07.5818691Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:07.5818976Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:07.5819278Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:07.5819543Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:07.5819965Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:07.5820259Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:07.5820530Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:07.5820833Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.5821345Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:07.5821656Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:07.5821921Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:07.5822188Z #define __FLT128_DIG__ 33
2025-05-07T20:25:07.5822436Z #define __INT32_C(c) c
2025-05-07T20:25:07.5822684Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:07.5822966Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:07.5823260Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:07.5823550Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:07.5823876Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:07.5824191Z #define unix 1
2025-05-07T20:25:07.5824421Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:07.5824691Z #define __cpp_rtti 199711L
2025-05-07T20:25:07.5824966Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:07.5825289Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5825596Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:07.5825915Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:07.5826253Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:07.5826509Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:07.5826813Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:07.5827110Z #define __ELF__ 1
2025-05-07T20:25:07.5827355Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:07.5827640Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:07.5827926Z #define __FLT_RADIX__ 2
2025-05-07T20:25:07.5828189Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:07.5828643Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:07.5829017Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:07.5829296Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:07.5829575Z #define __k8 1
2025-05-07T20:25:07.5829879Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:07.5830258Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:07.5830553Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:07.5830860Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:07.5831128Z #define __LDBL_DIG__ 18
2025-05-07T20:25:07.5831377Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:07.5831641Z #define __x86_64__ 1
2025-05-07T20:25:07.5831885Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5832191Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:07.5832528Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5832844Z #define __FLT64_DIG__ 15
2025-05-07T20:25:07.5833134Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5833495Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.5833820Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5834102Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:07.5834379Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5834682Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:07.5835055Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:07.5835454Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:07.5835747Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:07.5836076Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:07.5836404Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:07.5836734Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:07.5837044Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:07.5837337Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:07.5837643Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:07.5837945Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:07.5838194Z #define __SEG_FS 1
2025-05-07T20:25:07.5838431Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:07.5838807Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:07.5839095Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5839386Z #define __SEG_GS 1
2025-05-07T20:25:07.5839714Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:07.5840807Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:07.5841257Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:07.5841640Z #define __INT16_TYPE__ short int
2025-05-07T20:25:07.5842017Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:07.5842433Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:07.5842832Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:07.5843129Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:07.5843401Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:07.5843743Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.5844143Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5844464Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:07.5844794Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:07.5845096Z #define linux 1
2025-05-07T20:25:07.5845323Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5845602Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.5845873Z #define __EXCEPTIONS 1
2025-05-07T20:25:07.5846118Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:07.5846379Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:07.5846644Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:07.5846934Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:07.5847287Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.5847669Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:07.5848068Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:07.5848514Z #define __code_model_small__ 1
2025-05-07T20:25:07.5850456Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:07.5850768Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:07.5851073Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:07.5851360Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:07.5851650Z #define __k8__ 1
2025-05-07T20:25:07.5851880Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:07.5852169Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:07.5852464Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:07.5852715Z #define __pic__ 2
2025-05-07T20:25:07.5852966Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5853271Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:07.5853544Z #define __cpp_decltype 200707L
2025-05-07T20:25:07.5853845Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5854171Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:07.5854544Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.5854925Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:07.5855219Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.5855546Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:07.5855842Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:07.5856107Z #define __linux__ 1
2025-05-07T20:25:07.5856339Z #define __INT64_TYPE__ long int
2025-05-07T20:25:07.5856601Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:07.5856866Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:07.5857145Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:07.5857435Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:07.5857754Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:07.5858056Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5858377Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:07.5858641Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:07.5858939Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:07.5859241Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:07.5859568Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.5859934Z #define __SSE__ 1
2025-05-07T20:25:07.5860171Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5860644Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.5860996Z #define __amd64__ 1
2025-05-07T20:25:07.5861323Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:07.5861574Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:07.5861852Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:07.5862119Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:07.5862397Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:07.5862651Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:07.5862926Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5863195Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:07.5863533Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:07.5864002Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:07.5864366Z #define _LP64 1
2025-05-07T20:25:07.5864581Z #define __UINT8_C(c) c
2025-05-07T20:25:07.5864823Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:07.5865092Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:07.5865364Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:07.5865630Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:07.5865990Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.5866456Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.5866824Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5867121Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5867434Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:07.5867739Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:07.5868125Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:07.5868500Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:07.5868761Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:07.5869122Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:07.5869459Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:07.5869819Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5870086Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:07.5870403Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:07.5870655Z #define __FXSR__ 1
2025-05-07T20:25:07.5870959Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.5871411Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.5871825Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.5872138Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:07.5872403Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:07.5872706Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:07.5873015Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:07.5873283Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:07.5873646Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:07.5874024Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:07.5874296Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:07.5874551Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:07.5874795Z #define __PIC__ 2
2025-05-07T20:25:07.5875057Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:07.5875458Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.5875852Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:07.5876202Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.5876546Z #define __cpp_constexpr 201603L
2025-05-07T20:25:07.5876813Z #define __SSE2__ 1
2025-05-07T20:25:07.5877056Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:07.5877341Z #define __INT32_TYPE__ int
2025-05-07T20:25:07.5877595Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:07.5877858Z #define __cpp_exceptions 199711L
2025-05-07T20:25:07.5878129Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.5878468Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:07.5878831Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:07.5879190Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:07.5879457Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:07.5879725Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5880004Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:07.5880254Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:07.5880509Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:07.5880806Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:07.5881092Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5881393Z #define __PIE__ 2
2025-05-07T20:25:07.5881718Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:07.5882128Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:07.5882438Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:07.5882785Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:07.5883156Z #define __INT16_C(c) c
2025-05-07T20:25:07.5883379Z #define __STDC__ 1
2025-05-07T20:25:07.5883601Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:07.5883863Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:07.5884135Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:07.5884397Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5884699Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:07.5885043Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:07.5885379Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:07.5885650Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5885936Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:07.5886218Z #define __SSE_MATH__ 1
2025-05-07T20:25:07.5886461Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:07.5886741Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:07.5887051Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:07.5887340Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:07.5887813Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5888083Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:07.5888390Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5888795Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.5889164Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:07.5889468Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:07.5889762Z #define _GNU_SOURCE 1
2025-05-07T20:25:07.5890004Z #define __cpp_init_captures 201304L
2025-05-07T20:25:07.5890289Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:07.5890542Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:07.5890703Z 
2025-05-07T20:25:07.6413388Z 
2025-05-07T20:25:07.6414018Z + conda run -n build_binary c++ --version
2025-05-07T20:25:07.6414280Z 
2025-05-07T20:25:09.5253755Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:09.5254153Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:09.5254615Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:09.5255187Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:09.5255525Z 
2025-05-07T20:25:09.5255546Z 
2025-05-07T20:25:09.5883769Z 
2025-05-07T20:25:09.5884422Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:09.5885509Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:09.5886148Z 
2025-05-07T20:25:11.5495418Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:11.5497718Z 
2025-05-07T20:25:11.5498121Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:11.5498700Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:11.5499025Z 
2025-05-07T20:25:13.5105153Z #define __cplusplus 201703L
2025-05-07T20:25:13.5108202Z 
2025-05-07T20:25:13.5109905Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:13.5156071Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:13.5156505Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:13.5168362Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:13.5168710Z env:
2025-05-07T20:25:13.5168932Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:13.5169241Z   BUILD_ENV: build_binary
2025-05-07T20:25:13.5169490Z   BUILD_TARGET: genai
2025-05-07T20:25:13.5169717Z   BUILD_VARIANT: cuda
2025-05-07T20:25:13.5169955Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:13.5170214Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:13.5170512Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:13.5170846Z ##[endgroup]
2025-05-07T20:25:13.8543353Z ################################################################################
2025-05-07T20:25:13.8543720Z # Install CUDA
2025-05-07T20:25:13.8543938Z #
2025-05-07T20:25:13.8559563Z # [2025-05-07T20:25:13.855Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:13.8559967Z ################################################################################
2025-05-07T20:25:13.8560195Z 
2025-05-07T20:25:13.8576330Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:13.9483334Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:13.9483780Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:13.9489108Z + conda clean --packages --tarball -y
2025-05-07T20:25:14.6574633Z 
2025-05-07T20:25:14.6575106Z Will remove 32 (140.4 MB) tarball(s).
2025-05-07T20:25:14.6575581Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:14.7202823Z 
2025-05-07T20:25:14.7213741Z + conda clean --all -y
2025-05-07T20:25:14.7214000Z 
2025-05-07T20:25:15.3890276Z There are no unused tarball(s) to remove.
2025-05-07T20:25:15.3890763Z Will remove 1 index cache(s).
2025-05-07T20:25:15.3891217Z There are no unused package(s) to remove.
2025-05-07T20:25:15.3891680Z There are no tempfile(s) to remove.
2025-05-07T20:25:15.3892119Z There are no logfile(s) to remove.
2025-05-07T20:25:15.4539099Z 
2025-05-07T20:25:15.4552510Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:15.4577857Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:16.3648231Z Channels:
2025-05-07T20:25:16.3648492Z  - conda-forge
2025-05-07T20:25:16.3648735Z Platform: linux-64
2025-05-07T20:25:26.8556977Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:27.9561843Z Solving environment: - \ | / done
2025-05-07T20:25:28.0291498Z 
2025-05-07T20:25:28.0291887Z ## Package Plan ##
2025-05-07T20:25:28.0292109Z 
2025-05-07T20:25:28.0292393Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:28.0292869Z 
2025-05-07T20:25:28.0293005Z   added / updated specs:
2025-05-07T20:25:28.0293354Z     - cuda=12.6.3
2025-05-07T20:25:28.0293545Z 
2025-05-07T20:25:28.0293586Z 
2025-05-07T20:25:28.0293777Z The following packages will be downloaded:
2025-05-07T20:25:28.0294255Z 
2025-05-07T20:25:28.0294420Z     package                    |            build
2025-05-07T20:25:28.0294888Z     ---------------------------|-----------------
2025-05-07T20:25:28.0295329Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:28.0295930Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:28.0296514Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:28.0297101Z     bzip2-1.0.8                |       h4bc722e_7         247 KB  conda-forge
2025-05-07T20:25:28.0297546Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:28.0297961Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:28.0298837Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:28.0299368Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0299882Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:28.0300366Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:28.0300860Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:28.0301425Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.0301905Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.0302413Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:28.0302933Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.0303462Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:28.0303992Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:28.0304489Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:28.0304956Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:28.0305415Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:28.0305884Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:28.0306357Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.0306861Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:28.0307383Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:28.0307835Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0308330Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0308997Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:28.0309436Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:28.0309910Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:28.0310399Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:28.0310866Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:28.0311346Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:28.0311825Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:28.0312292Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:28.0312744Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:28.0313211Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:28.0313671Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:28.0314127Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:28.0314571Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:28.0315040Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:28.0315527Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:28.0315992Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:28.0316450Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:28.0316894Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:28.0317478Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:28.0317975Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:28.0318453Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.0318939Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:28.0319408Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.0319854Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.0320295Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:28.0320760Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.0321224Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:28.0321646Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:28.0322053Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:28.0322526Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:28.0323053Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:28.0323580Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:28.0324083Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:28.0324577Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:28.0325054Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:28.0325539Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:28.0325984Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:28.0326388Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.0326892Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:28.0327305Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:28.0327682Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.0328090Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:28.0328498Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:28.0328897Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:28.0329319Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:28.0329776Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:28.0330226Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:28.0330672Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:28.0331126Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:28.0331586Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:28.0332047Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:28.0332524Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:28.0332992Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:28.0333468Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:28.0333934Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:28.0334415Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:28.0334890Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:28.0335433Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:28.0335882Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:28.0336341Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:28.0336806Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:28.0337256Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:28.0337688Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:28.0338127Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:28.0338547Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:28.0338959Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:28.0339404Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:28.0339840Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:28.0340513Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:28.0340948Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:28.0341473Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:28.0341946Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:28.0342415Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:28.0342882Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:28.0343341Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:28.0343782Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:28.0344206Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:28.0344790Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:28.0345230Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:28.0345648Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:28.0346063Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:28.0346499Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:28.0346946Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:28.0347367Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:28.0347784Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:28.0348198Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:28.0348655Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:28.0349099Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:28.0349491Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:28.0349896Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:28.0350344Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:28.0350796Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:28.0351233Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:28.0351691Z     python-3.9.18              |h0755675_1_cpython        22.7 MB  conda-forge
2025-05-07T20:25:28.0352120Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:28.0352673Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:28.0353093Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:28.0353499Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:28.0353919Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:28.0354365Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:28.0354833Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:28.0355293Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:28.0355781Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:28.0356254Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:28.0356716Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:28.0357187Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:28.0357627Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:28.0358066Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:28.0358505Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:28.0358982Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:28.0359475Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:28.0359952Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.0360405Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:28.0360866Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.0361326Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:28.0361889Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:28.0362370Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:28.0362836Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:28.0363257Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:28.0363643Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:28.0364026Z     ------------------------------------------------------------
2025-05-07T20:25:28.0364380Z                                            Total:        1.63 GB
2025-05-07T20:25:28.0364595Z 
2025-05-07T20:25:28.0364728Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:28.0364961Z 
2025-05-07T20:25:28.0365170Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:28.0365602Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:28.0366038Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:28.0366465Z   bzip2              conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 
2025-05-07T20:25:28.0366909Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:28.0367355Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:28.0367832Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:28.0368426Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.0369011Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:28.0369565Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.0370135Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:28.0370742Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0371277Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0371857Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.0374653Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0375308Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.0375918Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.0376487Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0377011Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.0377519Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:28.0378060Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0378615Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0379195Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.0379722Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:28.0380222Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:28.0380794Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:28.0381425Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.0381921Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:28.0382450Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:28.0383016Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:28.0383562Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.0384231Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:28.0384780Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0385311Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0385817Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.0386332Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0386838Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.0387352Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.0387850Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.0388386Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.0388961Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:28.0389509Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.0390016Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:28.0390499Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.0391029Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0391606Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.0392142Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:28.0392705Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.0393262Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:28.0393843Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.0394333Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:28.0394920Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.0395468Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:28.0395922Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:28.0396333Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.0396855Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:28.0397466Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:28.0398066Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:28.0398645Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:28.0399157Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:28.0399665Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:28.0400157Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:28.0400629Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.0401060Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:28.0401492Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.0401925Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:28.0402315Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:28.0402734Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:28.0403155Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:28.0403572Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:28.0404121Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.0404638Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.0405140Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:28.0405633Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:28.0406137Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:28.0406645Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.0407152Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:28.0407670Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:28.0408203Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.0408754Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.0409298Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:28.0409846Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:28.0410372Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:28.0410840Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.0411311Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.0411826Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:28.0412350Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:28.0412842Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:28.0413305Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:28.0413870Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:28.0414309Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:28.0414737Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.0415214Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.0415682Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:28.0416116Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:28.0416583Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.0417118Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.0417662Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.0418213Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.0418745Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:28.0419276Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:28.0419772Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:28.0420220Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:28.0420693Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:28.0421227Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:28.0421669Z   libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:28.0422094Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:28.0422567Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:28.0423063Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:28.0423529Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:28.0424047Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.0424471Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:28.0424964Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:28.0425456Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:28.0425837Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:28.0426254Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:28.0426757Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:28.0427258Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:28.0427725Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:28.0428232Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:28.0439257Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:28.0439725Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:28.0440407Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.0440955Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:28.0441515Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:28.0442105Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:28.0442649Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:28.0443168Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:28.0443704Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:28.0444408Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:28.0444897Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:28.0445391Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:28.0445952Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:28.0446549Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:28.0447081Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.0447591Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:28.0448113Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:28.0448610Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:28.0449110Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:28.0449670Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:28.0450205Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:28.0450654Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:28.0450912Z 
2025-05-07T20:25:28.0451034Z The following packages will be UPDATED:
2025-05-07T20:25:28.0451239Z 
2025-05-07T20:25:28.0451480Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.0451813Z 
2025-05-07T20:25:28.0452041Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:28.0452362Z 
2025-05-07T20:25:28.0452646Z   python                pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 
2025-05-07T20:25:28.0453273Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:28.0453857Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:28.0454315Z 
2025-05-07T20:25:28.0454336Z 
2025-05-07T20:25:28.0454340Z 
2025-05-07T20:25:28.0454521Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:28.0454997Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:28.0455335Z 
2025-05-07T20:25:28.0455741Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.0455992Z 
2025-05-07T20:25:28.0455996Z 
2025-05-07T20:25:28.0456218Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:28.0456465Z 
2025-05-07T20:25:28.0456476Z 
2025-05-07T20:25:28.0456480Z 
2025-05-07T20:25:28.0456729Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.0457058Z 
2025-05-07T20:25:28.0457062Z 
2025-05-07T20:25:28.0457065Z 
2025-05-07T20:25:28.0457069Z 
2025-05-07T20:25:28.0457321Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.0457590Z 
2025-05-07T20:25:28.0457600Z 
2025-05-07T20:25:28.0457610Z 
2025-05-07T20:25:28.0457613Z 
2025-05-07T20:25:28.0457624Z 
2025-05-07T20:25:28.0457869Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:28.0458136Z 
2025-05-07T20:25:28.0458140Z 
2025-05-07T20:25:28.0458143Z 
2025-05-07T20:25:28.0458147Z 
2025-05-07T20:25:28.0458151Z 
2025-05-07T20:25:28.0458155Z 
2025-05-07T20:25:28.0458838Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:28.0459144Z 
2025-05-07T20:25:28.0459149Z 
2025-05-07T20:25:28.0459153Z 
2025-05-07T20:25:28.0459157Z 
2025-05-07T20:25:28.0459160Z 
2025-05-07T20:25:28.0459164Z 
2025-05-07T20:25:28.0459178Z 
2025-05-07T20:25:28.0460594Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:28.0460883Z 
2025-05-07T20:25:28.0460887Z 
2025-05-07T20:25:28.0460891Z 
2025-05-07T20:25:28.0460900Z 
2025-05-07T20:25:28.0460903Z 
2025-05-07T20:25:28.0460907Z 
2025-05-07T20:25:28.0460911Z 
2025-05-07T20:25:28.0461057Z 
2025-05-07T20:25:28.0462089Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0462493Z 
2025-05-07T20:25:28.0462499Z 
2025-05-07T20:25:28.0462504Z 
2025-05-07T20:25:28.0462509Z 
2025-05-07T20:25:28.0462515Z 
2025-05-07T20:25:28.0462527Z 
2025-05-07T20:25:28.0462531Z 
2025-05-07T20:25:28.0462536Z 
2025-05-07T20:25:28.0462541Z 
2025-05-07T20:25:28.0470060Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0470350Z 
2025-05-07T20:25:28.0470354Z 
2025-05-07T20:25:28.0470358Z 
2025-05-07T20:25:28.0470362Z 
2025-05-07T20:25:28.0470401Z 
2025-05-07T20:25:28.0470405Z 
2025-05-07T20:25:28.0470411Z 
2025-05-07T20:25:28.0470415Z 
2025-05-07T20:25:28.0470419Z 
2025-05-07T20:25:28.0476555Z 
2025-05-07T20:25:28.0478426Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0478838Z 
2025-05-07T20:25:28.0478844Z 
2025-05-07T20:25:28.0478849Z 
2025-05-07T20:25:28.0478866Z 
2025-05-07T20:25:28.0478879Z 
2025-05-07T20:25:28.0478894Z 
2025-05-07T20:25:28.0478900Z 
2025-05-07T20:25:28.0478905Z 
2025-05-07T20:25:28.0478910Z 
2025-05-07T20:25:28.0478915Z 
2025-05-07T20:25:28.0478920Z 
2025-05-07T20:25:28.0480602Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0481032Z 
2025-05-07T20:25:28.0481038Z 
2025-05-07T20:25:28.0481043Z 
2025-05-07T20:25:28.0481048Z 
2025-05-07T20:25:28.0481053Z 
2025-05-07T20:25:28.0481058Z 
2025-05-07T20:25:28.0481064Z 
2025-05-07T20:25:28.0481069Z 
2025-05-07T20:25:28.0481074Z 
2025-05-07T20:25:28.0481079Z 
2025-05-07T20:25:28.0481084Z 
2025-05-07T20:25:28.0483570Z 
2025-05-07T20:25:28.0487252Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0487660Z 
2025-05-07T20:25:28.0487666Z 
2025-05-07T20:25:28.0487671Z 
2025-05-07T20:25:28.0487676Z 
2025-05-07T20:25:28.0487681Z 
2025-05-07T20:25:28.0487687Z 
2025-05-07T20:25:28.0487702Z 
2025-05-07T20:25:28.0487849Z 
2025-05-07T20:25:28.0487855Z 
2025-05-07T20:25:28.0487860Z 
2025-05-07T20:25:28.0487865Z 
2025-05-07T20:25:28.0487870Z 
2025-05-07T20:25:28.0487875Z 
2025-05-07T20:25:28.0489116Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0489512Z 
2025-05-07T20:25:28.0489517Z 
2025-05-07T20:25:28.0489522Z 
2025-05-07T20:25:28.0489527Z 
2025-05-07T20:25:28.0489532Z 
2025-05-07T20:25:28.0489547Z 
2025-05-07T20:25:28.0489553Z 
2025-05-07T20:25:28.0489558Z 
2025-05-07T20:25:28.0489563Z 
2025-05-07T20:25:28.0489575Z 
2025-05-07T20:25:28.0489580Z 
2025-05-07T20:25:28.0489586Z 
2025-05-07T20:25:28.0489591Z 
2025-05-07T20:25:28.0489596Z 
2025-05-07T20:25:28.0490704Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0491138Z 
2025-05-07T20:25:28.0491143Z 
2025-05-07T20:25:28.0491148Z 
2025-05-07T20:25:28.0491154Z 
2025-05-07T20:25:28.0491159Z 
2025-05-07T20:25:28.0491172Z 
2025-05-07T20:25:28.0491185Z 
2025-05-07T20:25:28.0491195Z 
2025-05-07T20:25:28.0491200Z 
2025-05-07T20:25:28.0491205Z 
2025-05-07T20:25:28.0491210Z 
2025-05-07T20:25:28.0491215Z 
2025-05-07T20:25:28.0491220Z 
2025-05-07T20:25:28.0491226Z 
2025-05-07T20:25:28.0492808Z 
2025-05-07T20:25:28.0494322Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0494892Z 
2025-05-07T20:25:28.0494898Z 
2025-05-07T20:25:28.0494904Z 
2025-05-07T20:25:28.0494910Z 
2025-05-07T20:25:28.0494916Z 
2025-05-07T20:25:28.0494922Z 
2025-05-07T20:25:28.0494928Z 
2025-05-07T20:25:28.0494934Z 
2025-05-07T20:25:28.0494939Z 
2025-05-07T20:25:28.0494945Z 
2025-05-07T20:25:28.0494951Z 
2025-05-07T20:25:28.0494964Z 
2025-05-07T20:25:28.0494970Z 
2025-05-07T20:25:28.0494976Z 
2025-05-07T20:25:28.0494982Z 
2025-05-07T20:25:28.0494994Z 
2025-05-07T20:25:28.0496486Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0498031Z 
2025-05-07T20:25:28.0498046Z 
2025-05-07T20:25:28.0498051Z 
2025-05-07T20:25:28.0498057Z 
2025-05-07T20:25:28.0498062Z 
2025-05-07T20:25:28.0498068Z 
2025-05-07T20:25:28.0498074Z 
2025-05-07T20:25:28.0498079Z 
2025-05-07T20:25:28.0498084Z 
2025-05-07T20:25:28.0498089Z 
2025-05-07T20:25:28.0498095Z 
2025-05-07T20:25:28.0498100Z 
2025-05-07T20:25:28.0498106Z 
2025-05-07T20:25:28.0498111Z 
2025-05-07T20:25:28.0498117Z 
2025-05-07T20:25:28.0498122Z 
2025-05-07T20:25:28.0498128Z 
2025-05-07T20:25:28.0498715Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0499229Z 
2025-05-07T20:25:28.0499235Z 
2025-05-07T20:25:28.0499240Z 
2025-05-07T20:25:28.0499246Z 
2025-05-07T20:25:28.0499252Z 
2025-05-07T20:25:28.0499257Z 
2025-05-07T20:25:28.0499263Z 
2025-05-07T20:25:28.0499269Z 
2025-05-07T20:25:28.0499285Z 
2025-05-07T20:25:28.0499291Z 
2025-05-07T20:25:28.0499297Z 
2025-05-07T20:25:28.0499302Z 
2025-05-07T20:25:28.0499308Z 
2025-05-07T20:25:28.0499337Z 
2025-05-07T20:25:28.0499343Z 
2025-05-07T20:25:28.0499349Z 
2025-05-07T20:25:28.0499354Z 
2025-05-07T20:25:28.0499360Z 
2025-05-07T20:25:28.0499945Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0500398Z 
2025-05-07T20:25:28.0500404Z 
2025-05-07T20:25:28.0500409Z 
2025-05-07T20:25:28.0500414Z 
2025-05-07T20:25:28.0500419Z 
2025-05-07T20:25:28.0500425Z 
2025-05-07T20:25:28.0500430Z 
2025-05-07T20:25:28.0500446Z 
2025-05-07T20:25:28.0500452Z 
2025-05-07T20:25:28.0500457Z 
2025-05-07T20:25:28.0500462Z 
2025-05-07T20:25:28.0500467Z 
2025-05-07T20:25:28.0500471Z 
2025-05-07T20:25:28.0500474Z 
2025-05-07T20:25:28.0500478Z 
2025-05-07T20:25:28.0500481Z 
2025-05-07T20:25:28.0500485Z 
2025-05-07T20:25:28.0500489Z 
2025-05-07T20:25:28.0500492Z 
2025-05-07T20:25:28.1391000Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.1398638Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:25:28.1399263Z 
2025-05-07T20:25:28.1414450Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.1414746Z 
2025-05-07T20:25:28.1414751Z 
2025-05-07T20:25:28.1435279Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:28.1435551Z 
2025-05-07T20:25:28.1435556Z 
2025-05-07T20:25:28.1436612Z 
2025-05-07T20:25:28.1455531Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.1455934Z 
2025-05-07T20:25:28.1455940Z 
2025-05-07T20:25:28.1455946Z 
2025-05-07T20:25:28.1456893Z 
2025-05-07T20:25:28.2396632Z cuda-nsight-12.6.77  | 113.2 MB  | 1          |   1% [A[A[A[A
2025-05-07T20:25:28.2401575Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:25:28.2403237Z 
2025-05-07T20:25:28.2416720Z libcublas-12.6.4.1   | 256.2 MB  |            |   1% [A
2025-05-07T20:25:28.2417117Z 
2025-05-07T20:25:28.2418713Z 
2025-05-07T20:25:28.2437314Z libcufft-11.3.0.4    | 156.2 MB  | 2          |   2% [A[A
2025-05-07T20:25:28.2437735Z 
2025-05-07T20:25:28.2437742Z 
2025-05-07T20:25:28.2437747Z 
2025-05-07T20:25:28.2456887Z libcusparse-12.5.4.2 | 118.6 MB  | 3          |   4% [A[A[A
2025-05-07T20:25:28.2457232Z 
2025-05-07T20:25:28.2457236Z 
2025-05-07T20:25:28.2457239Z 
2025-05-07T20:25:28.2459403Z 
2025-05-07T20:25:28.3406243Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   5% [A[A[A[A
2025-05-07T20:25:28.3417489Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:25:28.3417838Z 
2025-05-07T20:25:28.3418574Z 
2025-05-07T20:25:28.3458104Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   5% [A[A
2025-05-07T20:25:28.3458470Z 
2025-05-07T20:25:28.3458476Z 
2025-05-07T20:25:28.3458482Z 
2025-05-07T20:25:28.3459952Z 
2025-05-07T20:25:28.3509086Z cuda-nsight-12.6.77  | 113.2 MB  | 8          |   8% [A[A[A[A
2025-05-07T20:25:28.3511444Z 
2025-05-07T20:25:28.3518218Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   2% [A
2025-05-07T20:25:28.3518603Z 
2025-05-07T20:25:28.3518893Z 
2025-05-07T20:25:28.3519619Z 
2025-05-07T20:25:28.4410243Z libcusparse-12.5.4.2 | 118.6 MB  | 6          |   7% [A[A[A
2025-05-07T20:25:28.4420142Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:28.4420488Z 
2025-05-07T20:25:28.4422556Z 
2025-05-07T20:25:28.4460341Z libcufft-11.3.0.4    | 156.2 MB  | 7          |   7% [A[A
2025-05-07T20:25:28.4460704Z 
2025-05-07T20:25:28.4460710Z 
2025-05-07T20:25:28.4460715Z 
2025-05-07T20:25:28.4463129Z 
2025-05-07T20:25:28.4515814Z cuda-nsight-12.6.77  | 113.2 MB  | #1         |  12% [A[A[A[A
2025-05-07T20:25:28.4516793Z 
2025-05-07T20:25:28.4708544Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   3% [A
2025-05-07T20:25:28.4708900Z 
2025-05-07T20:25:28.4708905Z 
2025-05-07T20:25:28.4710885Z 
2025-05-07T20:25:28.5479272Z libcusparse-12.5.4.2 | 118.6 MB  | 9          |   9% [A[A[A
2025-05-07T20:25:28.5479676Z 
2025-05-07T20:25:28.5479713Z 
2025-05-07T20:25:28.5517280Z libcufft-11.3.0.4    | 156.2 MB  | 9          |   9% [A[A
2025-05-07T20:25:28.5518936Z 
2025-05-07T20:25:28.5548349Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   5% [A
2025-05-07T20:25:28.5574594Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:28.5574961Z 
2025-05-07T20:25:28.5575216Z 
2025-05-07T20:25:28.5575223Z 
2025-05-07T20:25:28.5575248Z 
2025-05-07T20:25:28.5710222Z cuda-nsight-12.6.77  | 113.2 MB  | #5         |  15% [A[A[A[A
2025-05-07T20:25:28.5710611Z 
2025-05-07T20:25:28.5710616Z 
2025-05-07T20:25:28.5712696Z 
2025-05-07T20:25:28.6479479Z libcusparse-12.5.4.2 | 118.6 MB  | #2         |  12% [A[A[A
2025-05-07T20:25:28.6479876Z 
2025-05-07T20:25:28.6479891Z 
2025-05-07T20:25:28.6519120Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  12% [A[A
2025-05-07T20:25:28.6521915Z 
2025-05-07T20:25:28.6574928Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:25:28.6575293Z 
2025-05-07T20:25:28.6575395Z 
2025-05-07T20:25:28.6575401Z 
2025-05-07T20:25:28.6575423Z 
2025-05-07T20:25:28.6593993Z cuda-nsight-12.6.77  | 113.2 MB  | #8         |  18% [A[A[A[A
2025-05-07T20:25:28.6712103Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:28.6712459Z 
2025-05-07T20:25:28.6712712Z 
2025-05-07T20:25:28.6714348Z 
2025-05-07T20:25:28.7480518Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:28.7480893Z 
2025-05-07T20:25:28.7480903Z 
2025-05-07T20:25:28.7519042Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  14% [A[A
2025-05-07T20:25:28.7523159Z 
2025-05-07T20:25:28.7578281Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   8% [A
2025-05-07T20:25:28.7578614Z 
2025-05-07T20:25:28.7578621Z 
2025-05-07T20:25:28.7578626Z 
2025-05-07T20:25:28.7579172Z 
2025-05-07T20:25:28.7612767Z cuda-nsight-12.6.77  | 113.2 MB  | ##1        |  22% [A[A[A[A
2025-05-07T20:25:28.7712117Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:25:28.7712374Z 
2025-05-07T20:25:28.7712379Z 
2025-05-07T20:25:28.7715407Z 
2025-05-07T20:25:28.8484858Z libcusparse-12.5.4.2 | 118.6 MB  | #8         |  18% [A[A[A
2025-05-07T20:25:28.8485292Z 
2025-05-07T20:25:28.8485846Z 
2025-05-07T20:25:28.8558125Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  16% [A[A
2025-05-07T20:25:28.8558488Z 
2025-05-07T20:25:28.8613153Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   9% [A
2025-05-07T20:25:28.8713151Z nsight-compute-2024. | 443.1 MB  | 6          |   6% 
2025-05-07T20:25:28.8713509Z 
2025-05-07T20:25:28.8713515Z 
2025-05-07T20:25:28.8714844Z 
2025-05-07T20:25:28.8830058Z libcusparse-12.5.4.2 | 118.6 MB  | ##         |  21% [A[A[A
2025-05-07T20:25:28.8830463Z 
2025-05-07T20:25:28.8830468Z 
2025-05-07T20:25:28.8830474Z 
2025-05-07T20:25:28.8832700Z 
2025-05-07T20:25:28.9616993Z cuda-nsight-12.6.77  | 113.2 MB  | ##4        |  25% [A[A[A[A
2025-05-07T20:25:28.9625256Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:25:28.9625631Z 
2025-05-07T20:25:28.9625637Z 
2025-05-07T20:25:28.9649950Z libcufft-11.3.0.4    | 156.2 MB  | #8         |  18% [A[A
2025-05-07T20:25:28.9650508Z 
2025-05-07T20:25:28.9713321Z libcublas-12.6.4.1   | 256.2 MB  | #          |  10% [A
2025-05-07T20:25:28.9713659Z 
2025-05-07T20:25:28.9713664Z 
2025-05-07T20:25:28.9714616Z 
2025-05-07T20:25:28.9847848Z libcusparse-12.5.4.2 | 118.6 MB  | ##3        |  24% [A[A[A
2025-05-07T20:25:28.9848258Z 
2025-05-07T20:25:28.9848272Z 
2025-05-07T20:25:28.9848279Z 
2025-05-07T20:25:28.9848284Z 
2025-05-07T20:25:29.0617850Z cuda-nsight-12.6.77  | 113.2 MB  | ##7        |  28% [A[A[A[A
2025-05-07T20:25:29.0630727Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:25:29.0631103Z 
2025-05-07T20:25:29.0632281Z 
2025-05-07T20:25:29.0713652Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  20% [A[A
2025-05-07T20:25:29.0715212Z 
2025-05-07T20:25:29.0787307Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  11% [A
2025-05-07T20:25:29.0787713Z 
2025-05-07T20:25:29.0787719Z 
2025-05-07T20:25:29.0787725Z 
2025-05-07T20:25:29.0923651Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  26% [A[A[A
2025-05-07T20:25:29.0923961Z 
2025-05-07T20:25:29.0923966Z 
2025-05-07T20:25:29.0923969Z 
2025-05-07T20:25:29.0924836Z 
2025-05-07T20:25:29.1640680Z cuda-nsight-12.6.77  | 113.2 MB  | ###        |  31% [A[A[A[A
2025-05-07T20:25:29.1641600Z nsight-compute-2024. | 443.1 MB  | 8          |   9% 
2025-05-07T20:25:29.1641862Z 
2025-05-07T20:25:29.1642007Z 
2025-05-07T20:25:29.1713528Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  22% [A[A
2025-05-07T20:25:29.1715200Z 
2025-05-07T20:25:29.1827465Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  13% [A
2025-05-07T20:25:29.1827758Z 
2025-05-07T20:25:29.1827764Z 
2025-05-07T20:25:29.1829763Z 
2025-05-07T20:25:29.2035630Z libcusparse-12.5.4.2 | 118.6 MB  | ##9        |  29% [A[A[A
2025-05-07T20:25:29.2036049Z 
2025-05-07T20:25:29.2036055Z 
2025-05-07T20:25:29.2036061Z 
2025-05-07T20:25:29.2037662Z 
2025-05-07T20:25:29.2644763Z cuda-nsight-12.6.77  | 113.2 MB  | ###3       |  34% [A[A[A[A
2025-05-07T20:25:29.2645369Z 
2025-05-07T20:25:29.2645411Z 
2025-05-07T20:25:29.2661403Z libcufft-11.3.0.4    | 156.2 MB  | ##4        |  25% [A[A
2025-05-07T20:25:29.2779179Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:25:29.2779556Z 
2025-05-07T20:25:29.2830487Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  14% [A
2025-05-07T20:25:29.2830752Z 
2025-05-07T20:25:29.2830756Z 
2025-05-07T20:25:29.2833898Z 
2025-05-07T20:25:29.3039730Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  32% [A[A[A
2025-05-07T20:25:29.3040376Z 
2025-05-07T20:25:29.3040382Z 
2025-05-07T20:25:29.3040388Z 
2025-05-07T20:25:29.3042794Z 
2025-05-07T20:25:29.3659834Z cuda-nsight-12.6.77  | 113.2 MB  | ###6       |  36% [A[A[A[A
2025-05-07T20:25:29.3660142Z 
2025-05-07T20:25:29.3660146Z 
2025-05-07T20:25:29.3667001Z libcufft-11.3.0.4    | 156.2 MB  | ##6        |  27% [A[A
2025-05-07T20:25:29.3892782Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:29.3893051Z 
2025-05-07T20:25:29.3893055Z 
2025-05-07T20:25:29.3897078Z 
2025-05-07T20:25:29.3901894Z libcusparse-12.5.4.2 | 118.6 MB  | ###4       |  35% [A[A[A
2025-05-07T20:25:29.3905125Z 
2025-05-07T20:25:29.4091743Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  15% [A
2025-05-07T20:25:29.4092021Z 
2025-05-07T20:25:29.4092024Z 
2025-05-07T20:25:29.4092028Z 
2025-05-07T20:25:29.4100968Z 
2025-05-07T20:25:29.4663438Z cuda-nsight-12.6.77  | 113.2 MB  | ###9       |  39% [A[A[A[A
2025-05-07T20:25:29.4663892Z 
2025-05-07T20:25:29.4663899Z 
2025-05-07T20:25:29.4668015Z libcufft-11.3.0.4    | 156.2 MB  | ##9        |  29% [A[A
2025-05-07T20:25:29.4899475Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:29.4899788Z 
2025-05-07T20:25:29.4899793Z 
2025-05-07T20:25:29.4899800Z 
2025-05-07T20:25:29.4902733Z libcusparse-12.5.4.2 | 118.6 MB  | ###7       |  38% [A[A[A
2025-05-07T20:25:29.4905566Z 
2025-05-07T20:25:29.5095588Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  16% [A
2025-05-07T20:25:29.5095859Z 
2025-05-07T20:25:29.5095864Z 
2025-05-07T20:25:29.5096144Z 
2025-05-07T20:25:29.5096891Z 
2025-05-07T20:25:29.5666213Z cuda-nsight-12.6.77  | 113.2 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:25:29.5666515Z 
2025-05-07T20:25:29.5666519Z 
2025-05-07T20:25:29.5688145Z libcufft-11.3.0.4    | 156.2 MB  | ###1       |  31% [A[A
2025-05-07T20:25:29.5904730Z nsight-compute-2024. | 443.1 MB  | #1         |  12% 
2025-05-07T20:25:29.5906514Z 
2025-05-07T20:25:29.5996362Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  18% [A
2025-05-07T20:25:29.5996669Z 
2025-05-07T20:25:29.5996681Z 
2025-05-07T20:25:29.6001718Z 
2025-05-07T20:25:29.6095933Z libcusparse-12.5.4.2 | 118.6 MB  | ####       |  40% [A[A[A
2025-05-07T20:25:29.6096214Z 
2025-05-07T20:25:29.6096218Z 
2025-05-07T20:25:29.6096222Z 
2025-05-07T20:25:29.6097811Z 
2025-05-07T20:25:29.6667598Z cuda-nsight-12.6.77  | 113.2 MB  | ####4      |  45% [A[A[A[A
2025-05-07T20:25:29.6668009Z 
2025-05-07T20:25:29.6668016Z 
2025-05-07T20:25:29.6691383Z libcufft-11.3.0.4    | 156.2 MB  | ###3       |  34% [A[A
2025-05-07T20:25:29.6918713Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:29.6921539Z 
2025-05-07T20:25:29.6996724Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  19% [A
2025-05-07T20:25:29.6997002Z 
2025-05-07T20:25:29.6997006Z 
2025-05-07T20:25:29.6999006Z 
2025-05-07T20:25:29.7097000Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  43% [A[A[A
2025-05-07T20:25:29.7097322Z 
2025-05-07T20:25:29.7097326Z 
2025-05-07T20:25:29.7097330Z 
2025-05-07T20:25:29.7099232Z 
2025-05-07T20:25:29.7668472Z cuda-nsight-12.6.77  | 113.2 MB  | ####7      |  48% [A[A[A[A
2025-05-07T20:25:29.7668767Z 
2025-05-07T20:25:29.7668771Z 
2025-05-07T20:25:29.7691733Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  36% [A[A
2025-05-07T20:25:29.7920126Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:25:29.7921958Z 
2025-05-07T20:25:29.8001759Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  20% [A
2025-05-07T20:25:29.8002149Z 
2025-05-07T20:25:29.8002156Z 
2025-05-07T20:25:29.8003098Z 
2025-05-07T20:25:29.8103967Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  46% [A[A[A
2025-05-07T20:25:29.8104261Z 
2025-05-07T20:25:29.8104265Z 
2025-05-07T20:25:29.8104269Z 
2025-05-07T20:25:29.8105134Z 
2025-05-07T20:25:29.8694793Z cuda-nsight-12.6.77  | 113.2 MB  | #####      |  51% [A[A[A[A
2025-05-07T20:25:29.8709564Z nsight-compute-2024. | 443.1 MB  | #4         |  14% 
2025-05-07T20:25:29.8709827Z 
2025-05-07T20:25:29.8711055Z 
2025-05-07T20:25:29.8922540Z libcufft-11.3.0.4    | 156.2 MB  | ###8       |  38% [A[A
2025-05-07T20:25:29.8927113Z 
2025-05-07T20:25:29.9006152Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  22% [A
2025-05-07T20:25:29.9006562Z 
2025-05-07T20:25:29.9006569Z 
2025-05-07T20:25:29.9007818Z 
2025-05-07T20:25:29.9108524Z libcusparse-12.5.4.2 | 118.6 MB  | ####8      |  49% [A[A[A
2025-05-07T20:25:29.9108824Z 
2025-05-07T20:25:29.9108829Z 
2025-05-07T20:25:29.9108834Z 
2025-05-07T20:25:29.9109369Z 
2025-05-07T20:25:29.9726958Z cuda-nsight-12.6.77  | 113.2 MB  | #####3     |  54% [A[A[A[A
2025-05-07T20:25:29.9800554Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:25:29.9800825Z 
2025-05-07T20:25:29.9802172Z 
2025-05-07T20:25:29.9945882Z libcufft-11.3.0.4    | 156.2 MB  | ####       |  40% [A[A
2025-05-07T20:25:29.9946306Z 
2025-05-07T20:25:30.0007203Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  23% [A
2025-05-07T20:25:30.0007624Z 
2025-05-07T20:25:30.0007631Z 
2025-05-07T20:25:30.0008453Z 
2025-05-07T20:25:30.0118542Z libcusparse-12.5.4.2 | 118.6 MB  | #####1     |  52% [A[A[A
2025-05-07T20:25:30.0118851Z 
2025-05-07T20:25:30.0118855Z 
2025-05-07T20:25:30.0118859Z 
2025-05-07T20:25:30.0119267Z 
2025-05-07T20:25:30.0773702Z cuda-nsight-12.6.77  | 113.2 MB  | #####6     |  57% [A[A[A[A
2025-05-07T20:25:30.0886054Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:25:30.0886418Z 
2025-05-07T20:25:30.0886424Z 
2025-05-07T20:25:30.0982988Z libcufft-11.3.0.4    | 156.2 MB  | ####2      |  43% [A[A
2025-05-07T20:25:30.0985321Z 
2025-05-07T20:25:30.1011055Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  24% [A
2025-05-07T20:25:30.1011485Z 
2025-05-07T20:25:30.1011492Z 
2025-05-07T20:25:30.1011555Z 
2025-05-07T20:25:30.1119959Z libcusparse-12.5.4.2 | 118.6 MB  | #####4     |  54% [A[A[A
2025-05-07T20:25:30.1120385Z 
2025-05-07T20:25:30.1120391Z 
2025-05-07T20:25:30.1120412Z 
2025-05-07T20:25:30.1121858Z 
2025-05-07T20:25:30.1773833Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  60% [A[A[A[A
2025-05-07T20:25:30.1983357Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:30.1983798Z 
2025-05-07T20:25:30.2012410Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  27% [A
2025-05-07T20:25:30.2012730Z 
2025-05-07T20:25:30.2012734Z 
2025-05-07T20:25:30.2012738Z 
2025-05-07T20:25:30.2116128Z libcusparse-12.5.4.2 | 118.6 MB  | #####8     |  59% [A[A[A
2025-05-07T20:25:30.2116458Z 
2025-05-07T20:25:30.2118163Z 
2025-05-07T20:25:30.2616051Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  45% [A[A
2025-05-07T20:25:30.2616499Z 
2025-05-07T20:25:30.2616505Z 
2025-05-07T20:25:30.2616510Z 
2025-05-07T20:25:30.2617110Z 
2025-05-07T20:25:30.2776058Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:25:30.2984380Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:25:30.2985244Z 
2025-05-07T20:25:30.3151170Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  28% [A
2025-05-07T20:25:30.3151537Z 
2025-05-07T20:25:30.3151543Z 
2025-05-07T20:25:30.3617734Z libcufft-11.3.0.4    | 156.2 MB  | ####6      |  47% [A[A
2025-05-07T20:25:30.3618107Z 
2025-05-07T20:25:30.3618113Z 
2025-05-07T20:25:30.3618118Z 
2025-05-07T20:25:30.3618125Z 
2025-05-07T20:25:30.3778227Z cuda-nsight-12.6.77  | 113.2 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:25:30.3778529Z 
2025-05-07T20:25:30.3778533Z 
2025-05-07T20:25:30.3781896Z 
2025-05-07T20:25:30.4013283Z libcusparse-12.5.4.2 | 118.6 MB  | ######1    |  62% [A[A[A
2025-05-07T20:25:30.4110302Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:25:30.4111488Z 
2025-05-07T20:25:30.4181040Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  30% [A
2025-05-07T20:25:30.4181410Z 
2025-05-07T20:25:30.4181414Z 
2025-05-07T20:25:30.4689410Z libcufft-11.3.0.4    | 156.2 MB  | ####8      |  49% [A[A
2025-05-07T20:25:30.4689703Z 
2025-05-07T20:25:30.4689707Z 
2025-05-07T20:25:30.4689712Z 
2025-05-07T20:25:30.4689716Z 
2025-05-07T20:25:30.4782587Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:25:30.4782926Z 
2025-05-07T20:25:30.4782931Z 
2025-05-07T20:25:30.4782935Z 
2025-05-07T20:25:30.5181556Z libcusparse-12.5.4.2 | 118.6 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:30.5181892Z 
2025-05-07T20:25:30.5181896Z 
2025-05-07T20:25:30.5186092Z libcufft-11.3.0.4    | 156.2 MB  | #####      |  51% [A[A
2025-05-07T20:25:30.5289482Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:30.5291956Z 
2025-05-07T20:25:30.5694618Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  31% [A
2025-05-07T20:25:30.5694918Z 
2025-05-07T20:25:30.5694923Z 
2025-05-07T20:25:30.5694926Z 
2025-05-07T20:25:30.5695399Z 
2025-05-07T20:25:30.5820162Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  72% [A[A[A[A
2025-05-07T20:25:30.5820509Z 
2025-05-07T20:25:30.5820515Z 
2025-05-07T20:25:30.5825609Z 
2025-05-07T20:25:30.6182096Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  68% [A[A[A
2025-05-07T20:25:30.6182432Z 
2025-05-07T20:25:30.6182436Z 
2025-05-07T20:25:30.6280899Z libcufft-11.3.0.4    | 156.2 MB  | #####2     |  53% [A[A
2025-05-07T20:25:30.6290447Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:25:30.6290827Z 
2025-05-07T20:25:30.6695500Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:25:30.6695798Z 
2025-05-07T20:25:30.6695802Z 
2025-05-07T20:25:30.6695806Z 
2025-05-07T20:25:30.6696147Z 
2025-05-07T20:25:30.6822130Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:25:30.6822428Z 
2025-05-07T20:25:30.6822692Z 
2025-05-07T20:25:30.6822917Z 
2025-05-07T20:25:30.7184937Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  71% [A[A[A
2025-05-07T20:25:30.7185242Z 
2025-05-07T20:25:30.7185247Z 
2025-05-07T20:25:30.7283338Z libcufft-11.3.0.4    | 156.2 MB  | #####5     |  55% [A[A
2025-05-07T20:25:30.7293395Z nsight-compute-2024. | 443.1 MB  | ##2        |  22% 
2025-05-07T20:25:30.7294310Z 
2025-05-07T20:25:30.7697608Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  34% [A
2025-05-07T20:25:30.7697887Z 
2025-05-07T20:25:30.7697893Z 
2025-05-07T20:25:30.7697899Z 
2025-05-07T20:25:30.7700098Z 
2025-05-07T20:25:30.7887442Z cuda-nsight-12.6.77  | 113.2 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:25:30.7888097Z 
2025-05-07T20:25:30.7888103Z 
2025-05-07T20:25:30.7888511Z 
2025-05-07T20:25:30.8190751Z libcusparse-12.5.4.2 | 118.6 MB  | #######4   |  74% [A[A[A
2025-05-07T20:25:30.8191068Z 
2025-05-07T20:25:30.8191072Z 
2025-05-07T20:25:30.8285242Z libcufft-11.3.0.4    | 156.2 MB  | #####7     |  57% [A[A
2025-05-07T20:25:30.8293983Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:25:30.8296687Z 
2025-05-07T20:25:30.8697803Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  36% [A
2025-05-07T20:25:30.8698220Z 
2025-05-07T20:25:30.8698226Z 
2025-05-07T20:25:30.8698232Z 
2025-05-07T20:25:30.8699093Z 
2025-05-07T20:25:30.8889935Z cuda-nsight-12.6.77  | 113.2 MB  | ########2  |  83% [A[A[A[A
2025-05-07T20:25:30.8890276Z 
2025-05-07T20:25:30.8890280Z 
2025-05-07T20:25:30.8892753Z 
2025-05-07T20:25:30.9192781Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  77% [A[A[A
2025-05-07T20:25:30.9193077Z 
2025-05-07T20:25:30.9193688Z 
2025-05-07T20:25:30.9334638Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:25:30.9350983Z nsight-compute-2024. | 443.1 MB  | ##3        |  24% 
2025-05-07T20:25:30.9351404Z 
2025-05-07T20:25:30.9698922Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  37% [A
2025-05-07T20:25:30.9699343Z 
2025-05-07T20:25:30.9699349Z 
2025-05-07T20:25:30.9699382Z 
2025-05-07T20:25:30.9701561Z 
2025-05-07T20:25:30.9891612Z cuda-nsight-12.6.77  | 113.2 MB  | ########6  |  86% [A[A[A[A
2025-05-07T20:25:30.9891981Z 
2025-05-07T20:25:30.9892536Z 
2025-05-07T20:25:30.9894033Z 
2025-05-07T20:25:31.0193316Z libcusparse-12.5.4.2 | 118.6 MB  | ########   |  81% [A[A[A
2025-05-07T20:25:31.0193623Z 
2025-05-07T20:25:31.0194227Z 
2025-05-07T20:25:31.0338049Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  62% [A[A
2025-05-07T20:25:31.0700766Z nsight-compute-2024. | 443.1 MB  | ##4        |  25% 
2025-05-07T20:25:31.0701041Z 
2025-05-07T20:25:31.0701045Z 
2025-05-07T20:25:31.0701050Z 
2025-05-07T20:25:31.0701766Z 
2025-05-07T20:25:31.0749194Z cuda-nsight-12.6.77  | 113.2 MB  | ########9  |  90% [A[A[A[A
2025-05-07T20:25:31.0750231Z 
2025-05-07T20:25:31.0894078Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  39% [A
2025-05-07T20:25:31.0894415Z 
2025-05-07T20:25:31.0894421Z 
2025-05-07T20:25:31.0896311Z 
2025-05-07T20:25:31.1198850Z libcusparse-12.5.4.2 | 118.6 MB  | ########4  |  84% [A[A[A
2025-05-07T20:25:31.1199229Z 
2025-05-07T20:25:31.1199236Z 
2025-05-07T20:25:31.1339472Z libcufft-11.3.0.4    | 156.2 MB  | ######4    |  64% [A[A
2025-05-07T20:25:31.1733837Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:31.1734343Z 
2025-05-07T20:25:31.1734349Z 
2025-05-07T20:25:31.1734354Z 
2025-05-07T20:25:31.1735081Z 
2025-05-07T20:25:31.1750837Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:25:31.1751130Z 
2025-05-07T20:25:31.1920123Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  40% [A
2025-05-07T20:25:31.1920418Z 
2025-05-07T20:25:31.1920424Z 
2025-05-07T20:25:31.1922824Z 
2025-05-07T20:25:31.2239315Z libcusparse-12.5.4.2 | 118.6 MB  | ########7  |  87% [A[A[A
2025-05-07T20:25:31.2239704Z 
2025-05-07T20:25:31.2241332Z 
2025-05-07T20:25:31.2376409Z libcufft-11.3.0.4    | 156.2 MB  | ######6    |  66% [A[A
2025-05-07T20:25:31.2735198Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:31.2735995Z 
2025-05-07T20:25:31.2736004Z 
2025-05-07T20:25:31.2736010Z 
2025-05-07T20:25:31.2736974Z 
2025-05-07T20:25:31.2751701Z cuda-nsight-12.6.77  | 113.2 MB  | #########6 |  97% [A[A[A[A
2025-05-07T20:25:31.2752696Z 
2025-05-07T20:25:31.2952597Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  42% [A
2025-05-07T20:25:31.2952963Z 
2025-05-07T20:25:31.2952967Z 
2025-05-07T20:25:31.2953639Z 
2025-05-07T20:25:31.3283904Z libcusparse-12.5.4.2 | 118.6 MB  | #########  |  90% [A[A[A
2025-05-07T20:25:31.3284339Z 
2025-05-07T20:25:31.3284345Z 
2025-05-07T20:25:31.3455102Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  69% [A[A
2025-05-07T20:25:31.3752014Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:25:31.3752875Z 
2025-05-07T20:25:31.3953072Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  44% [A
2025-05-07T20:25:31.3953359Z 
2025-05-07T20:25:31.3953366Z 
2025-05-07T20:25:31.3955329Z 
2025-05-07T20:25:31.4457872Z libcusparse-12.5.4.2 | 118.6 MB  | #########4 |  95% [A[A[A
2025-05-07T20:25:31.4755450Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:25:31.4758268Z 
2025-05-07T20:25:31.4956506Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  46% [A
2025-05-07T20:25:31.4956841Z 
2025-05-07T20:25:31.4956847Z 
2025-05-07T20:25:31.4956852Z 
2025-05-07T20:25:31.5067273Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  99% [A[A[A
2025-05-07T20:25:31.5067623Z 
2025-05-07T20:25:31.5067629Z 
2025-05-07T20:25:31.5521599Z libcufft-11.3.0.4    | 156.2 MB  | #######    |  71% [A[A
2025-05-07T20:25:31.5867449Z nsight-compute-2024. | 443.1 MB  | ###        |  30% 
2025-05-07T20:25:31.5868848Z 
2025-05-07T20:25:31.6071370Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  47% [A
2025-05-07T20:25:31.6071693Z 
2025-05-07T20:25:31.6071698Z 
2025-05-07T20:25:31.6548938Z libcufft-11.3.0.4    | 156.2 MB  | #######3   |  73% [A[A
2025-05-07T20:25:31.6867624Z nsight-compute-2024. | 443.1 MB  | ###1       |  31% 
2025-05-07T20:25:31.6868836Z 
2025-05-07T20:25:31.7074400Z libcublas-12.6.4.1   | 256.2 MB  | ####9      |  49% [A
2025-05-07T20:25:31.7074952Z 
2025-05-07T20:25:31.7074957Z 
2025-05-07T20:25:31.7550363Z libcufft-11.3.0.4    | 156.2 MB  | #######5   |  76% [A[A
2025-05-07T20:25:31.7891412Z nsight-compute-2024. | 443.1 MB  | ###2       |  32% 
2025-05-07T20:25:31.7891824Z 
2025-05-07T20:25:31.8126436Z libcublas-12.6.4.1   | 256.2 MB  | #####1     |  51% [A
2025-05-07T20:25:31.8126779Z 
2025-05-07T20:25:31.8129601Z 
2025-05-07T20:25:31.8551579Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  78% [A[A
2025-05-07T20:25:31.9137995Z nsight-compute-2024. | 443.1 MB  | ###3       |  33% 
2025-05-07T20:25:31.9138269Z 
2025-05-07T20:25:31.9138273Z 
2025-05-07T20:25:31.9435246Z libcufft-11.3.0.4    | 156.2 MB  | ########   |  81% [A[A
2025-05-07T20:25:31.9436564Z 
2025-05-07T20:25:31.9552439Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  53% [A
2025-05-07T20:25:32.0437365Z nsight-compute-2024. | 443.1 MB  | ###4       |  34% 
2025-05-07T20:25:32.0437832Z 
2025-05-07T20:25:32.0488444Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  55% [A
2025-05-07T20:25:32.0488769Z 
2025-05-07T20:25:32.0488773Z 
2025-05-07T20:25:32.0561371Z libcufft-11.3.0.4    | 156.2 MB  | ########2  |  83% [A[A
2025-05-07T20:25:32.1438884Z nsight-compute-2024. | 443.1 MB  | ###5       |  36% 
2025-05-07T20:25:32.1442294Z 
2025-05-07T20:25:32.1563432Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  56% [A
2025-05-07T20:25:32.1721599Z nsight-compute-2024. | 443.1 MB  | ###6       |  37% 
2025-05-07T20:25:32.1721877Z 
2025-05-07T20:25:32.1723953Z 
2025-05-07T20:25:32.2440511Z libcufft-11.3.0.4    | 156.2 MB  | ########5  |  85% [A[A
2025-05-07T20:25:32.2441582Z 
2025-05-07T20:25:32.2567323Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  58% [A
2025-05-07T20:25:32.3218550Z nsight-compute-2024. | 443.1 MB  | ###7       |  38% 
2025-05-07T20:25:32.3218880Z 
2025-05-07T20:25:32.3219604Z 
2025-05-07T20:25:32.3444284Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  87% [A[A
2025-05-07T20:25:32.3445639Z 
2025-05-07T20:25:32.3569428Z libcublas-12.6.4.1   | 256.2 MB  | ######     |  60% [A
2025-05-07T20:25:32.4220956Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:32.4221294Z 
2025-05-07T20:25:32.4222745Z 
2025-05-07T20:25:32.4503068Z libcufft-11.3.0.4    | 156.2 MB  | ########9  |  89% [A[A
2025-05-07T20:25:32.4504858Z 
2025-05-07T20:25:32.4799735Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  62% [A
2025-05-07T20:25:32.5220988Z nsight-compute-2024. | 443.1 MB  | ####       |  40% 
2025-05-07T20:25:32.5221361Z 
2025-05-07T20:25:32.5222665Z 
2025-05-07T20:25:32.5562117Z libcufft-11.3.0.4    | 156.2 MB  | #########2 |  92% [A[A
2025-05-07T20:25:32.5564688Z 
2025-05-07T20:25:32.5811609Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  64% [A
2025-05-07T20:25:32.6564392Z nsight-compute-2024. | 443.1 MB  | ####1      |  41% 
2025-05-07T20:25:32.6565086Z 
2025-05-07T20:25:32.6645203Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  65% [A
2025-05-07T20:25:32.6645488Z 
2025-05-07T20:25:32.6649977Z 
2025-05-07T20:25:32.6813155Z libcufft-11.3.0.4    | 156.2 MB  | #########4 |  94% [A[A
2025-05-07T20:25:32.7587727Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:25:32.7588785Z 
2025-05-07T20:25:32.7645646Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  67% [A
2025-05-07T20:25:32.7645955Z 
2025-05-07T20:25:32.7645961Z 
2025-05-07T20:25:32.8105710Z libcufft-11.3.0.4    | 156.2 MB  | #########6 |  96% [A[A
2025-05-07T20:25:32.8629797Z nsight-compute-2024. | 443.1 MB  | ####3      |  44% 
2025-05-07T20:25:32.8630148Z 
2025-05-07T20:25:32.8646817Z libcublas-12.6.4.1   | 256.2 MB  | ######8    |  69% [A
2025-05-07T20:25:32.8647124Z 
2025-05-07T20:25:32.8647130Z 
2025-05-07T20:25:32.9212754Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  98% [A[A
2025-05-07T20:25:32.9631900Z nsight-compute-2024. | 443.1 MB  | ####4      |  45% 
2025-05-07T20:25:32.9632267Z 
2025-05-07T20:25:33.0221264Z libcublas-12.6.4.1   | 256.2 MB  | #######    |  71% [A
2025-05-07T20:25:33.0696811Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:25:33.0697976Z 
2025-05-07T20:25:33.1237006Z libcublas-12.6.4.1   | 256.2 MB  | #######2   |  73% [A
2025-05-07T20:25:33.1709768Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:25:33.1711377Z 
2025-05-07T20:25:33.2257739Z libcublas-12.6.4.1   | 256.2 MB  | #######4   |  74% [A
2025-05-07T20:25:33.2731522Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:25:33.2732984Z 
2025-05-07T20:25:33.3258790Z libcublas-12.6.4.1   | 256.2 MB  | #######6   |  76% [A
2025-05-07T20:25:33.3762272Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:25:33.3762647Z 
2025-05-07T20:25:33.4191089Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  78% [A
2025-05-07T20:25:33.4191469Z 
2025-05-07T20:25:33.4191476Z 
2025-05-07T20:25:33.4191482Z 
2025-05-07T20:25:33.4191488Z 
2025-05-07T20:25:33.4265055Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.4832029Z nsight-compute-2024. | 443.1 MB  | #####      |  50% 
2025-05-07T20:25:33.4834256Z 
2025-05-07T20:25:33.5020803Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  80% [A
2025-05-07T20:25:33.5021287Z 
2025-05-07T20:25:33.5021304Z 
2025-05-07T20:25:33.5021309Z 
2025-05-07T20:25:33.5021314Z 
2025-05-07T20:25:33.5021319Z 
2025-05-07T20:25:33.5370991Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:33.6021526Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:25:33.6021883Z 
2025-05-07T20:25:33.6022418Z 
2025-05-07T20:25:33.6022428Z 
2025-05-07T20:25:33.6022435Z 
2025-05-07T20:25:33.6022622Z 
2025-05-07T20:25:33.6056106Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   3% [A[A[A[A[A
2025-05-07T20:25:33.6056476Z 
2025-05-07T20:25:33.6056482Z 
2025-05-07T20:25:33.6056492Z 
2025-05-07T20:25:33.6340819Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:33.6347241Z 
2025-05-07T20:25:33.6615473Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:25:33.6615839Z 
2025-05-07T20:25:33.6615851Z 
2025-05-07T20:25:33.6615857Z 
2025-05-07T20:25:33.6615863Z 
2025-05-07T20:25:33.6615868Z 
2025-05-07T20:25:33.6617159Z 
2025-05-07T20:25:33.6650730Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:33.7023862Z nsight-compute-2024. | 443.1 MB  | #####2     |  52% 
2025-05-07T20:25:33.7024226Z 
2025-05-07T20:25:33.7024232Z 
2025-05-07T20:25:33.7024238Z 
2025-05-07T20:25:33.7024243Z 
2025-05-07T20:25:33.7025642Z 
2025-05-07T20:25:33.7618000Z cuda-nvvp-12.6.80    | 109.3 MB  | 5          |   5% [A[A[A[A[A
2025-05-07T20:25:33.7618344Z 
2025-05-07T20:25:33.7618348Z 
2025-05-07T20:25:33.7618352Z 
2025-05-07T20:25:33.7618357Z 
2025-05-07T20:25:33.7618360Z 
2025-05-07T20:25:33.7620199Z 
2025-05-07T20:25:33.7905870Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:25:33.7983878Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:33.7989492Z 
2025-05-07T20:25:33.8026783Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  83% [A
2025-05-07T20:25:33.8027137Z 
2025-05-07T20:25:33.8027143Z 
2025-05-07T20:25:33.8027149Z 
2025-05-07T20:25:33.8027154Z 
2025-05-07T20:25:33.8027160Z 
2025-05-07T20:25:33.8622425Z cuda-nvvp-12.6.80    | 109.3 MB  | 7          |   8% [A[A[A[A[A
2025-05-07T20:25:33.8622794Z 
2025-05-07T20:25:33.8622804Z 
2025-05-07T20:25:33.8622808Z 
2025-05-07T20:25:33.8622812Z 
2025-05-07T20:25:33.8622815Z 
2025-05-07T20:25:33.8623997Z 
2025-05-07T20:25:33.9034775Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:25:33.9035145Z 
2025-05-07T20:25:33.9035149Z 
2025-05-07T20:25:33.9035153Z 
2025-05-07T20:25:33.9035157Z 
2025-05-07T20:25:33.9038235Z 
2025-05-07T20:25:33.9196832Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  10% [A[A[A[A[A
2025-05-07T20:25:33.9328790Z nsight-compute-2024. | 443.1 MB  | #####3     |  54% 
2025-05-07T20:25:33.9329220Z 
2025-05-07T20:25:33.9630122Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  84% [A
2025-05-07T20:25:33.9630730Z 
2025-05-07T20:25:33.9630735Z 
2025-05-07T20:25:33.9630739Z 
2025-05-07T20:25:33.9630742Z 
2025-05-07T20:25:33.9630746Z 
2025-05-07T20:25:33.9630750Z 
2025-05-07T20:25:34.0037843Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   8% [A[A[A[A[A[A
2025-05-07T20:25:34.0038154Z 
2025-05-07T20:25:34.0038158Z 
2025-05-07T20:25:34.0038162Z 
2025-05-07T20:25:34.0038165Z 
2025-05-07T20:25:34.0046125Z 
2025-05-07T20:25:34.0437647Z cuda-nvvp-12.6.80    | 109.3 MB  | #2         |  13% [A[A[A[A[A
2025-05-07T20:25:34.0589745Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:25:34.0592004Z 
2025-05-07T20:25:34.0638531Z libcublas-12.6.4.1   | 256.2 MB  | ########5  |  86% [A
2025-05-07T20:25:34.0638796Z 
2025-05-07T20:25:34.0638800Z 
2025-05-07T20:25:34.0638817Z 
2025-05-07T20:25:34.0638820Z 
2025-05-07T20:25:34.0638824Z 
2025-05-07T20:25:34.0641667Z 
2025-05-07T20:25:34.1094012Z libcusolver-11.7.1.2 | 95.8 MB   | #          |  11% [A[A[A[A[A[A
2025-05-07T20:25:34.1094372Z 
2025-05-07T20:25:34.1094385Z 
2025-05-07T20:25:34.1094389Z 
2025-05-07T20:25:34.1094393Z 
2025-05-07T20:25:34.1096867Z 
2025-05-07T20:25:34.1487739Z cuda-nvvp-12.6.80    | 109.3 MB  | #4         |  15% [A[A[A[A[A
2025-05-07T20:25:34.1646106Z nsight-compute-2024. | 443.1 MB  | #####5     |  56% 
2025-05-07T20:25:34.1646481Z 
2025-05-07T20:25:34.1646486Z 
2025-05-07T20:25:34.1646492Z 
2025-05-07T20:25:34.1646497Z 
2025-05-07T20:25:34.1646502Z 
2025-05-07T20:25:34.1646508Z 
2025-05-07T20:25:34.1671243Z libcusolver-11.7.1.2 | 95.8 MB   | #3         |  14% [A[A[A[A[A[A
2025-05-07T20:25:34.1671870Z 
2025-05-07T20:25:34.2097900Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  87% [A
2025-05-07T20:25:34.2098288Z 
2025-05-07T20:25:34.2098294Z 
2025-05-07T20:25:34.2098300Z 
2025-05-07T20:25:34.2098305Z 
2025-05-07T20:25:34.2100291Z 
2025-05-07T20:25:34.2577112Z cuda-nvvp-12.6.80    | 109.3 MB  | #7         |  17% [A[A[A[A[A
2025-05-07T20:25:34.2675442Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:34.2675809Z 
2025-05-07T20:25:34.2675816Z 
2025-05-07T20:25:34.2675821Z 
2025-05-07T20:25:34.2675826Z 
2025-05-07T20:25:34.2675831Z 
2025-05-07T20:25:34.2675836Z 
2025-05-07T20:25:34.2728970Z libcusolver-11.7.1.2 | 95.8 MB   | #6         |  16% [A[A[A[A[A[A
2025-05-07T20:25:34.2732536Z 
2025-05-07T20:25:34.3099866Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  88% [A
2025-05-07T20:25:34.3100225Z 
2025-05-07T20:25:34.3100231Z 
2025-05-07T20:25:34.3100238Z 
2025-05-07T20:25:34.3100244Z 
2025-05-07T20:25:34.3101597Z 
2025-05-07T20:25:34.3624339Z cuda-nvvp-12.6.80    | 109.3 MB  | #9         |  20% [A[A[A[A[A
2025-05-07T20:25:34.3679799Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:25:34.3680076Z 
2025-05-07T20:25:34.3680082Z 
2025-05-07T20:25:34.3680087Z 
2025-05-07T20:25:34.3680092Z 
2025-05-07T20:25:34.3680099Z 
2025-05-07T20:25:34.3683086Z 
2025-05-07T20:25:34.3843366Z libcusolver-11.7.1.2 | 95.8 MB   | #8         |  19% [A[A[A[A[A[A
2025-05-07T20:25:34.3846268Z 
2025-05-07T20:25:34.4184538Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  89% [A
2025-05-07T20:25:34.4184861Z 
2025-05-07T20:25:34.4184866Z 
2025-05-07T20:25:34.4184869Z 
2025-05-07T20:25:34.4184873Z 
2025-05-07T20:25:34.4188820Z 
2025-05-07T20:25:34.4683444Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  22% [A[A[A[A[A
2025-05-07T20:25:34.4683773Z 
2025-05-07T20:25:34.4683776Z 
2025-05-07T20:25:34.4683780Z 
2025-05-07T20:25:34.4683784Z 
2025-05-07T20:25:34.4683788Z 
2025-05-07T20:25:34.4683792Z 
2025-05-07T20:25:34.4716196Z libcusolver-11.7.1.2 | 95.8 MB   | ##1        |  22% [A[A[A[A[A[A
2025-05-07T20:25:34.4870873Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:25:34.4871263Z 
2025-05-07T20:25:34.5225454Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  90% [A
2025-05-07T20:25:34.5225820Z 
2025-05-07T20:25:34.5225824Z 
2025-05-07T20:25:34.5225828Z 
2025-05-07T20:25:34.5225854Z 
2025-05-07T20:25:34.5227118Z 
2025-05-07T20:25:34.5688415Z cuda-nvvp-12.6.80    | 109.3 MB  | ##4        |  25% [A[A[A[A[A
2025-05-07T20:25:34.5688791Z 
2025-05-07T20:25:34.5688797Z 
2025-05-07T20:25:34.5688802Z 
2025-05-07T20:25:34.5688807Z 
2025-05-07T20:25:34.5688812Z 
2025-05-07T20:25:34.5688818Z 
2025-05-07T20:25:34.5968737Z libcusolver-11.7.1.2 | 95.8 MB   | ##4        |  25% [A[A[A[A[A[A
2025-05-07T20:25:34.5972710Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:25:34.5979161Z 
2025-05-07T20:25:34.6231125Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  91% [A
2025-05-07T20:25:34.6231398Z 
2025-05-07T20:25:34.6231402Z 
2025-05-07T20:25:34.6231406Z 
2025-05-07T20:25:34.6231420Z 
2025-05-07T20:25:34.6238872Z 
2025-05-07T20:25:34.6692044Z cuda-nvvp-12.6.80    | 109.3 MB  | ##7        |  27% [A[A[A[A[A
2025-05-07T20:25:34.6692633Z 
2025-05-07T20:25:34.6692639Z 
2025-05-07T20:25:34.6692643Z 
2025-05-07T20:25:34.6692648Z 
2025-05-07T20:25:34.6692652Z 
2025-05-07T20:25:34.6692679Z 
2025-05-07T20:25:34.6975286Z libcusolver-11.7.1.2 | 95.8 MB   | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:25:34.6975632Z 
2025-05-07T20:25:34.7088338Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  92% [A
2025-05-07T20:25:34.7235353Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:25:34.7235711Z 
2025-05-07T20:25:34.7235715Z 
2025-05-07T20:25:34.7235719Z 
2025-05-07T20:25:34.7235723Z 
2025-05-07T20:25:34.7242963Z 
2025-05-07T20:25:34.7727162Z cuda-nvvp-12.6.80    | 109.3 MB  | ##9        |  30% [A[A[A[A[A
2025-05-07T20:25:34.7727849Z 
2025-05-07T20:25:34.7727855Z 
2025-05-07T20:25:34.7727861Z 
2025-05-07T20:25:34.7727866Z 
2025-05-07T20:25:34.7727871Z 
2025-05-07T20:25:34.7728814Z 
2025-05-07T20:25:34.8037509Z libcusolver-11.7.1.2 | 95.8 MB   | ###        |  31% [A[A[A[A[A[A
2025-05-07T20:25:34.8039450Z 
2025-05-07T20:25:34.8091727Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  93% [A
2025-05-07T20:25:34.8238156Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:25:34.8238461Z 
2025-05-07T20:25:34.8238465Z 
2025-05-07T20:25:34.8238469Z 
2025-05-07T20:25:34.8238473Z 
2025-05-07T20:25:34.8239099Z 
2025-05-07T20:25:34.8812061Z cuda-nvvp-12.6.80    | 109.3 MB  | ###2       |  32% [A[A[A[A[A
2025-05-07T20:25:34.8812407Z 
2025-05-07T20:25:34.8812411Z 
2025-05-07T20:25:34.8812415Z 
2025-05-07T20:25:34.8812419Z 
2025-05-07T20:25:34.8812430Z 
2025-05-07T20:25:34.8816015Z 
2025-05-07T20:25:34.9040808Z libcusolver-11.7.1.2 | 95.8 MB   | ###3       |  34% [A[A[A[A[A[A
2025-05-07T20:25:34.9042557Z 
2025-05-07T20:25:34.9091606Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  95% [A
2025-05-07T20:25:34.9365187Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:25:34.9365560Z 
2025-05-07T20:25:34.9365566Z 
2025-05-07T20:25:34.9365571Z 
2025-05-07T20:25:34.9365576Z 
2025-05-07T20:25:34.9368761Z 
2025-05-07T20:25:34.9871806Z cuda-nvvp-12.6.80    | 109.3 MB  | ###4       |  35% [A[A[A[A[A
2025-05-07T20:25:34.9872133Z 
2025-05-07T20:25:34.9872150Z 
2025-05-07T20:25:34.9872154Z 
2025-05-07T20:25:34.9872158Z 
2025-05-07T20:25:34.9872162Z 
2025-05-07T20:25:34.9872166Z 
2025-05-07T20:25:35.0044200Z libcusolver-11.7.1.2 | 95.8 MB   | ###6       |  36% [A[A[A[A[A[A
2025-05-07T20:25:35.0044498Z 
2025-05-07T20:25:35.0095175Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:25:35.0369167Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:25:35.0369441Z 
2025-05-07T20:25:35.0369445Z 
2025-05-07T20:25:35.0369449Z 
2025-05-07T20:25:35.0369457Z 
2025-05-07T20:25:35.0376432Z 
2025-05-07T20:25:35.0871869Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  37% [A[A[A[A[A
2025-05-07T20:25:35.0872294Z 
2025-05-07T20:25:35.0872300Z 
2025-05-07T20:25:35.0872305Z 
2025-05-07T20:25:35.0872310Z 
2025-05-07T20:25:35.0872316Z 
2025-05-07T20:25:35.0873539Z 
2025-05-07T20:25:35.1066484Z libcusolver-11.7.1.2 | 95.8 MB   | ###9       |  39% [A[A[A[A[A[A
2025-05-07T20:25:35.1066835Z 
2025-05-07T20:25:35.1181495Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:25:35.1370668Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:25:35.1371081Z 
2025-05-07T20:25:35.1371087Z 
2025-05-07T20:25:35.1371093Z 
2025-05-07T20:25:35.1371098Z 
2025-05-07T20:25:35.1372577Z 
2025-05-07T20:25:35.1872808Z cuda-nvvp-12.6.80    | 109.3 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:25:35.1873150Z 
2025-05-07T20:25:35.1873156Z 
2025-05-07T20:25:35.1873161Z 
2025-05-07T20:25:35.1873167Z 
2025-05-07T20:25:35.1873173Z 
2025-05-07T20:25:35.1878252Z 
2025-05-07T20:25:35.2074776Z libcusolver-11.7.1.2 | 95.8 MB   | ####2      |  42% [A[A[A[A[A[A
2025-05-07T20:25:35.2075082Z 
2025-05-07T20:25:35.2183977Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  98% [A
2025-05-07T20:25:35.2371004Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:25:35.2371441Z 
2025-05-07T20:25:35.2371532Z 
2025-05-07T20:25:35.2371538Z 
2025-05-07T20:25:35.2371543Z 
2025-05-07T20:25:35.2371667Z 
2025-05-07T20:25:35.2872804Z cuda-nvvp-12.6.80    | 109.3 MB  | ####2      |  43% [A[A[A[A[A
2025-05-07T20:25:35.2873095Z 
2025-05-07T20:25:35.2873099Z 
2025-05-07T20:25:35.2873103Z 
2025-05-07T20:25:35.2873107Z 
2025-05-07T20:25:35.2873110Z 
2025-05-07T20:25:35.2878600Z 
2025-05-07T20:25:35.3141017Z libcusolver-11.7.1.2 | 95.8 MB   | ####5      |  46% [A[A[A[A[A[A
2025-05-07T20:25:35.3142519Z 
2025-05-07T20:25:35.3184414Z libcublas-12.6.4.1   | 256.2 MB  | #########9 |  99% [A
2025-05-07T20:25:35.3399897Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:25:35.3400178Z 
2025-05-07T20:25:35.3400183Z 
2025-05-07T20:25:35.3400186Z 
2025-05-07T20:25:35.3400190Z 
2025-05-07T20:25:35.3400194Z 
2025-05-07T20:25:35.3873794Z cuda-nvvp-12.6.80    | 109.3 MB  | ####5      |  46% [A[A[A[A[A
2025-05-07T20:25:35.3874252Z 
2025-05-07T20:25:35.3874259Z 
2025-05-07T20:25:35.3874264Z 
2025-05-07T20:25:35.3874270Z 
2025-05-07T20:25:35.3874276Z 
2025-05-07T20:25:35.3877371Z 
2025-05-07T20:25:35.4188303Z libcusolver-11.7.1.2 | 95.8 MB   | ####9      |  49% [A[A[A[A[A[A
2025-05-07T20:25:35.4403216Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:25:35.4403487Z 
2025-05-07T20:25:35.4403560Z 
2025-05-07T20:25:35.4403566Z 
2025-05-07T20:25:35.4403570Z 
2025-05-07T20:25:35.4408589Z 
2025-05-07T20:25:35.4879043Z cuda-nvvp-12.6.80    | 109.3 MB  | ####8      |  48% [A[A[A[A[A
2025-05-07T20:25:35.4879374Z 
2025-05-07T20:25:35.4879382Z 
2025-05-07T20:25:35.4879387Z 
2025-05-07T20:25:35.4879392Z 
2025-05-07T20:25:35.4879398Z 
2025-05-07T20:25:35.4882482Z 
2025-05-07T20:25:35.5215296Z libcusolver-11.7.1.2 | 95.8 MB   | #####2     |  53% [A[A[A[A[A[A
2025-05-07T20:25:35.5406121Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:25:35.5406487Z 
2025-05-07T20:25:35.5406493Z 
2025-05-07T20:25:35.5406498Z 
2025-05-07T20:25:35.5406503Z 
2025-05-07T20:25:35.5409013Z 
2025-05-07T20:25:35.5895575Z cuda-nvvp-12.6.80    | 109.3 MB  | #####1     |  52% [A[A[A[A[A
2025-05-07T20:25:35.5895967Z 
2025-05-07T20:25:35.5895973Z 
2025-05-07T20:25:35.5895979Z 
2025-05-07T20:25:35.5895984Z 
2025-05-07T20:25:35.5895990Z 
2025-05-07T20:25:35.5897485Z 
2025-05-07T20:25:35.6251676Z libcusolver-11.7.1.2 | 95.8 MB   | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:25:35.6411899Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:25:35.6412163Z 
2025-05-07T20:25:35.6412168Z 
2025-05-07T20:25:35.6412172Z 
2025-05-07T20:25:35.6412175Z 
2025-05-07T20:25:35.6415758Z 
2025-05-07T20:25:35.6897288Z cuda-nvvp-12.6.80    | 109.3 MB  | #####4     |  54% [A[A[A[A[A
2025-05-07T20:25:35.6897616Z 
2025-05-07T20:25:35.6897622Z 
2025-05-07T20:25:35.6897627Z 
2025-05-07T20:25:35.6897632Z 
2025-05-07T20:25:35.6897640Z 
2025-05-07T20:25:35.6899187Z 
2025-05-07T20:25:35.7253720Z libcusolver-11.7.1.2 | 95.8 MB   | #####9     |  59% [A[A[A[A[A[A
2025-05-07T20:25:35.7417266Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:25:35.7417548Z 
2025-05-07T20:25:35.7417897Z 
2025-05-07T20:25:35.7417902Z 
2025-05-07T20:25:35.7417906Z 
2025-05-07T20:25:35.7420335Z 
2025-05-07T20:25:35.7948075Z cuda-nvvp-12.6.80    | 109.3 MB  | #####7     |  57% [A[A[A[A[A
2025-05-07T20:25:35.7948480Z 
2025-05-07T20:25:35.7948485Z 
2025-05-07T20:25:35.7948491Z 
2025-05-07T20:25:35.7948496Z 
2025-05-07T20:25:35.7948501Z 
2025-05-07T20:25:35.7948506Z 
2025-05-07T20:25:35.8282215Z libcusolver-11.7.1.2 | 95.8 MB   | ######2    |  62% [A[A[A[A[A[A
2025-05-07T20:25:35.8469328Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:25:35.8469702Z 
2025-05-07T20:25:35.8469708Z 
2025-05-07T20:25:35.8469713Z 
2025-05-07T20:25:35.8469718Z 
2025-05-07T20:25:35.8469724Z 
2025-05-07T20:25:35.8950636Z cuda-nvvp-12.6.80    | 109.3 MB  | #####9     |  60% [A[A[A[A[A
2025-05-07T20:25:35.8951046Z 
2025-05-07T20:25:35.8951052Z 
2025-05-07T20:25:35.8951058Z 
2025-05-07T20:25:35.8951063Z 
2025-05-07T20:25:35.8951068Z 
2025-05-07T20:25:35.8951073Z 
2025-05-07T20:25:35.9314062Z libcusolver-11.7.1.2 | 95.8 MB   | ######5    |  66% [A[A[A[A[A[A
2025-05-07T20:25:35.9474955Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:25:35.9475443Z 
2025-05-07T20:25:35.9475449Z 
2025-05-07T20:25:35.9475455Z 
2025-05-07T20:25:35.9475460Z 
2025-05-07T20:25:35.9475466Z 
2025-05-07T20:25:35.9961194Z cuda-nvvp-12.6.80    | 109.3 MB  | ######3    |  63% [A[A[A[A[A
2025-05-07T20:25:35.9961627Z 
2025-05-07T20:25:35.9961633Z 
2025-05-07T20:25:35.9961639Z 
2025-05-07T20:25:35.9961645Z 
2025-05-07T20:25:35.9961650Z 
2025-05-07T20:25:35.9962768Z 
2025-05-07T20:25:36.0316188Z libcusolver-11.7.1.2 | 95.8 MB   | ######8    |  69% [A[A[A[A[A[A
2025-05-07T20:25:36.0476411Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:25:36.0476680Z 
2025-05-07T20:25:36.0476685Z 
2025-05-07T20:25:36.0476689Z 
2025-05-07T20:25:36.0476693Z 
2025-05-07T20:25:36.0480365Z 
2025-05-07T20:25:36.0975150Z cuda-nvvp-12.6.80    | 109.3 MB  | ######5    |  66% [A[A[A[A[A
2025-05-07T20:25:36.0976008Z 
2025-05-07T20:25:36.0976017Z 
2025-05-07T20:25:36.0976023Z 
2025-05-07T20:25:36.0976028Z 
2025-05-07T20:25:36.0976033Z 
2025-05-07T20:25:36.0976038Z 
2025-05-07T20:25:36.1317261Z libcusolver-11.7.1.2 | 95.8 MB   | #######2   |  72% [A[A[A[A[A[A
2025-05-07T20:25:36.1572677Z nsight-compute-2024. | 443.1 MB  | ######9    |  70% 
2025-05-07T20:25:36.1572956Z 
2025-05-07T20:25:36.1572961Z 
2025-05-07T20:25:36.1572965Z 
2025-05-07T20:25:36.1572968Z 
2025-05-07T20:25:36.1575022Z 
2025-05-07T20:25:36.1980941Z cuda-nvvp-12.6.80    | 109.3 MB  | ######8    |  69% [A[A[A[A[A
2025-05-07T20:25:36.1981423Z 
2025-05-07T20:25:36.1981431Z 
2025-05-07T20:25:36.1981436Z 
2025-05-07T20:25:36.1981442Z 
2025-05-07T20:25:36.1981447Z 
2025-05-07T20:25:36.1981452Z 
2025-05-07T20:25:36.2318646Z libcusolver-11.7.1.2 | 95.8 MB   | #######5   |  76% [A[A[A[A[A[A
2025-05-07T20:25:36.2607537Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:36.2607819Z 
2025-05-07T20:25:36.2607854Z 
2025-05-07T20:25:36.2607877Z 
2025-05-07T20:25:36.2607882Z 
2025-05-07T20:25:36.2610822Z 
2025-05-07T20:25:36.3001305Z cuda-nvvp-12.6.80    | 109.3 MB  | #######1   |  72% [A[A[A[A[A
2025-05-07T20:25:36.3001733Z 
2025-05-07T20:25:36.3001739Z 
2025-05-07T20:25:36.3001744Z 
2025-05-07T20:25:36.3001749Z 
2025-05-07T20:25:36.3001756Z 
2025-05-07T20:25:36.3003947Z 
2025-05-07T20:25:36.3457666Z libcusolver-11.7.1.2 | 95.8 MB   | #######8   |  79% [A[A[A[A[A[A
2025-05-07T20:25:36.3607661Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:25:36.3608028Z 
2025-05-07T20:25:36.3608034Z 
2025-05-07T20:25:36.3608039Z 
2025-05-07T20:25:36.3608056Z 
2025-05-07T20:25:36.3610034Z 
2025-05-07T20:25:36.4006126Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:25:36.4006424Z 
2025-05-07T20:25:36.4006428Z 
2025-05-07T20:25:36.4006441Z 
2025-05-07T20:25:36.4006446Z 
2025-05-07T20:25:36.4006450Z 
2025-05-07T20:25:36.4008058Z 
2025-05-07T20:25:36.4584153Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:25:36.4608966Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:25:36.4609326Z 
2025-05-07T20:25:36.4609332Z 
2025-05-07T20:25:36.4609337Z 
2025-05-07T20:25:36.4609342Z 
2025-05-07T20:25:36.4610548Z 
2025-05-07T20:25:36.5051464Z cuda-nvvp-12.6.80    | 109.3 MB  | #######7   |  77% [A[A[A[A[A
2025-05-07T20:25:36.5051860Z 
2025-05-07T20:25:36.5051865Z 
2025-05-07T20:25:36.5051871Z 
2025-05-07T20:25:36.5051876Z 
2025-05-07T20:25:36.5051891Z 
2025-05-07T20:25:36.5054727Z 
2025-05-07T20:25:36.5584369Z libcusolver-11.7.1.2 | 95.8 MB   | ########5  |  86% [A[A[A[A[A[A
2025-05-07T20:25:36.5622027Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:36.5622391Z 
2025-05-07T20:25:36.5622397Z 
2025-05-07T20:25:36.5622402Z 
2025-05-07T20:25:36.5622407Z 
2025-05-07T20:25:36.5622412Z 
2025-05-07T20:25:36.6059008Z cuda-nvvp-12.6.80    | 109.3 MB  | ########   |  80% [A[A[A[A[A
2025-05-07T20:25:36.6059414Z 
2025-05-07T20:25:36.6059433Z 
2025-05-07T20:25:36.6059438Z 
2025-05-07T20:25:36.6059444Z 
2025-05-07T20:25:36.6059449Z 
2025-05-07T20:25:36.6062846Z 
2025-05-07T20:25:36.6636461Z libcusolver-11.7.1.2 | 95.8 MB   | ########8  |  89% [A[A[A[A[A[A
2025-05-07T20:25:36.6636865Z 
2025-05-07T20:25:36.6636870Z 
2025-05-07T20:25:36.6636876Z 
2025-05-07T20:25:36.6636881Z 
2025-05-07T20:25:36.6642292Z 
2025-05-07T20:25:36.6673138Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  83% [A[A[A[A[A
2025-05-07T20:25:36.7069111Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:36.7069478Z 
2025-05-07T20:25:36.7069483Z 
2025-05-07T20:25:36.7069489Z 
2025-05-07T20:25:36.7069498Z 
2025-05-07T20:25:36.7069504Z 
2025-05-07T20:25:36.7071089Z 
2025-05-07T20:25:36.7637349Z libcusolver-11.7.1.2 | 95.8 MB   | #########2 |  92% [A[A[A[A[A[A
2025-05-07T20:25:36.7637780Z 
2025-05-07T20:25:36.7637785Z 
2025-05-07T20:25:36.7637791Z 
2025-05-07T20:25:36.7637797Z 
2025-05-07T20:25:36.7639977Z 
2025-05-07T20:25:36.7756793Z cuda-nvvp-12.6.80    | 109.3 MB  | ########6  |  86% [A[A[A[A[A
2025-05-07T20:25:36.8071558Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:25:36.8071929Z 
2025-05-07T20:25:36.8071935Z 
2025-05-07T20:25:36.8071941Z 
2025-05-07T20:25:36.8071946Z 
2025-05-07T20:25:36.8071951Z 
2025-05-07T20:25:36.8071957Z 
2025-05-07T20:25:36.8646790Z libcusolver-11.7.1.2 | 95.8 MB   | #########5 |  96% [A[A[A[A[A[A
2025-05-07T20:25:36.8647206Z 
2025-05-07T20:25:36.8647212Z 
2025-05-07T20:25:36.8647217Z 
2025-05-07T20:25:36.8647222Z 
2025-05-07T20:25:36.8647228Z 
2025-05-07T20:25:36.8750096Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  89% [A[A[A[A[A
2025-05-07T20:25:36.8750577Z 
2025-05-07T20:25:36.8750582Z 
2025-05-07T20:25:36.8750587Z 
2025-05-07T20:25:36.8753360Z 
2025-05-07T20:25:36.8833160Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:36.9075579Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:25:36.9075966Z 
2025-05-07T20:25:36.9075984Z 
2025-05-07T20:25:36.9075994Z 
2025-05-07T20:25:36.9075999Z 
2025-05-07T20:25:36.9076005Z 
2025-05-07T20:25:36.9076035Z 
2025-05-07T20:25:36.9649572Z libcusolver-11.7.1.2 | 95.8 MB   | #########9 |  99% [A[A[A[A[A[A
2025-05-07T20:25:36.9649895Z 
2025-05-07T20:25:36.9649900Z 
2025-05-07T20:25:36.9649904Z 
2025-05-07T20:25:36.9649907Z 
2025-05-07T20:25:36.9649911Z 
2025-05-07T20:25:36.9836232Z cuda-nvvp-12.6.80    | 109.3 MB  | #########2 |  92% [A[A[A[A[A
2025-05-07T20:25:37.0650622Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:25:37.0651014Z 
2025-05-07T20:25:37.0651020Z 
2025-05-07T20:25:37.0651025Z 
2025-05-07T20:25:37.0651042Z 
2025-05-07T20:25:37.0651048Z 
2025-05-07T20:25:37.0839122Z cuda-nvvp-12.6.80    | 109.3 MB  | #########5 |  96% [A[A[A[A[A
2025-05-07T20:25:37.1658254Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:37.1658516Z 
2025-05-07T20:25:37.1658648Z 
2025-05-07T20:25:37.1658652Z 
2025-05-07T20:25:37.1658921Z 
2025-05-07T20:25:37.1662452Z 
2025-05-07T20:25:37.1840736Z cuda-nvvp-12.6.80    | 109.3 MB  | #########9 | 100% [A[A[A[A[A
2025-05-07T20:25:37.2841831Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:25:37.3845397Z nsight-compute-2024. | 443.1 MB  | #######8   |  78% 
2025-05-07T20:25:37.4848172Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:25:37.5319416Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:25:37.5319746Z 
2025-05-07T20:25:37.5323163Z 
2025-05-07T20:25:37.5849654Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:37.5878770Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:25:37.5879028Z 
2025-05-07T20:25:37.5879032Z 
2025-05-07T20:25:37.5879035Z 
2025-05-07T20:25:37.5879039Z 
2025-05-07T20:25:37.5879043Z 
2025-05-07T20:25:37.5879047Z 
2025-05-07T20:25:37.5879051Z 
2025-05-07T20:25:37.6883180Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:37.6883483Z 
2025-05-07T20:25:37.6883487Z 
2025-05-07T20:25:37.6883491Z 
2025-05-07T20:25:37.6883494Z 
2025-05-07T20:25:37.6883498Z 
2025-05-07T20:25:37.6883502Z 
2025-05-07T20:25:37.6883630Z 
2025-05-07T20:25:37.7005099Z libnpp-12.3.1.54     | 93.4 MB   | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:25:37.7883619Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:37.7883983Z 
2025-05-07T20:25:37.7883988Z 
2025-05-07T20:25:37.7883994Z 
2025-05-07T20:25:37.7883999Z 
2025-05-07T20:25:37.7884005Z 
2025-05-07T20:25:37.7884010Z 
2025-05-07T20:25:37.7887501Z 
2025-05-07T20:25:37.8269126Z libnpp-12.3.1.54     | 93.4 MB   | 7          |   7% [A[A[A[A[A[A[A
2025-05-07T20:25:37.8885302Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:25:37.8885663Z 
2025-05-07T20:25:37.8885668Z 
2025-05-07T20:25:37.8885674Z 
2025-05-07T20:25:37.8885679Z 
2025-05-07T20:25:37.8885685Z 
2025-05-07T20:25:37.8885690Z 
2025-05-07T20:25:37.8889963Z 
2025-05-07T20:25:37.9315599Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:37.9886333Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:25:37.9886695Z 
2025-05-07T20:25:37.9886700Z 
2025-05-07T20:25:37.9886706Z 
2025-05-07T20:25:37.9886711Z 
2025-05-07T20:25:37.9886716Z 
2025-05-07T20:25:37.9886721Z 
2025-05-07T20:25:37.9888618Z 
2025-05-07T20:25:38.0475406Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  15% [A[A[A[A[A[A[A
2025-05-07T20:25:38.0887312Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:25:38.0887643Z 
2025-05-07T20:25:38.0887649Z 
2025-05-07T20:25:38.0887654Z 
2025-05-07T20:25:38.0887659Z 
2025-05-07T20:25:38.0887665Z 
2025-05-07T20:25:38.0887670Z 
2025-05-07T20:25:38.0889264Z 
2025-05-07T20:25:38.1548290Z libnpp-12.3.1.54     | 93.4 MB   | #8         |  19% [A[A[A[A[A[A[A
2025-05-07T20:25:38.1933643Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:25:38.1933967Z 
2025-05-07T20:25:38.1933984Z 
2025-05-07T20:25:38.1933998Z 
2025-05-07T20:25:38.1934003Z 
2025-05-07T20:25:38.1934008Z 
2025-05-07T20:25:38.1934013Z 
2025-05-07T20:25:38.1934219Z 
2025-05-07T20:25:38.2929284Z libnpp-12.3.1.54     | 93.4 MB   | ##2        |  22% [A[A[A[A[A[A[A
2025-05-07T20:25:38.2937542Z nsight-compute-2024. | 443.1 MB  | ########6  |  87% 
2025-05-07T20:25:38.2937872Z 
2025-05-07T20:25:38.2937877Z 
2025-05-07T20:25:38.2937882Z 
2025-05-07T20:25:38.2937888Z 
2025-05-07T20:25:38.2937893Z 
2025-05-07T20:25:38.2937907Z 
2025-05-07T20:25:38.2940494Z 
2025-05-07T20:25:38.3939867Z libnpp-12.3.1.54     | 93.4 MB   | ##6        |  27% [A[A[A[A[A[A[A
2025-05-07T20:25:38.3940452Z 
2025-05-07T20:25:38.3940466Z 
2025-05-07T20:25:38.3940472Z 
2025-05-07T20:25:38.3940478Z 
2025-05-07T20:25:38.3940483Z 
2025-05-07T20:25:38.3940489Z 
2025-05-07T20:25:38.3945415Z 
2025-05-07T20:25:38.4133129Z libnpp-12.3.1.54     | 93.4 MB   | ###        |  31% [A[A[A[A[A[A[A
2025-05-07T20:25:38.4940264Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:25:38.4940832Z 
2025-05-07T20:25:38.4940836Z 
2025-05-07T20:25:38.4940839Z 
2025-05-07T20:25:38.4940843Z 
2025-05-07T20:25:38.4940854Z 
2025-05-07T20:25:38.4940858Z 
2025-05-07T20:25:38.4942645Z 
2025-05-07T20:25:38.5136265Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  35% [A[A[A[A[A[A[A
2025-05-07T20:25:38.6018416Z nsight-compute-2024. | 443.1 MB  | ########8  |  88% 
2025-05-07T20:25:38.6018771Z 
2025-05-07T20:25:38.6018777Z 
2025-05-07T20:25:38.6018783Z 
2025-05-07T20:25:38.6018788Z 
2025-05-07T20:25:38.6018793Z 
2025-05-07T20:25:38.6018799Z 
2025-05-07T20:25:38.6020549Z 
2025-05-07T20:25:38.6140603Z libnpp-12.3.1.54     | 93.4 MB   | ###8       |  39% [A[A[A[A[A[A[A
2025-05-07T20:25:38.7126616Z nsight-compute-2024. | 443.1 MB  | ########9  |  89% 
2025-05-07T20:25:38.7126873Z 
2025-05-07T20:25:38.7126877Z 
2025-05-07T20:25:38.7126880Z 
2025-05-07T20:25:38.7126884Z 
2025-05-07T20:25:38.7126887Z 
2025-05-07T20:25:38.7126891Z 
2025-05-07T20:25:38.7128807Z 
2025-05-07T20:25:38.7140966Z libnpp-12.3.1.54     | 93.4 MB   | ####2      |  43% [A[A[A[A[A[A[A
2025-05-07T20:25:38.8132733Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:25:38.8133001Z 
2025-05-07T20:25:38.8133005Z 
2025-05-07T20:25:38.8133009Z 
2025-05-07T20:25:38.8133013Z 
2025-05-07T20:25:38.8133017Z 
2025-05-07T20:25:38.8133020Z 
2025-05-07T20:25:38.8134779Z 
2025-05-07T20:25:38.8142213Z libnpp-12.3.1.54     | 93.4 MB   | ####6      |  47% [A[A[A[A[A[A[A
2025-05-07T20:25:38.9142906Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:38.9145390Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:25:38.9145702Z 
2025-05-07T20:25:38.9145707Z 
2025-05-07T20:25:38.9145712Z 
2025-05-07T20:25:38.9145717Z 
2025-05-07T20:25:38.9145723Z 
2025-05-07T20:25:38.9145728Z 
2025-05-07T20:25:38.9145733Z 
2025-05-07T20:25:39.0145132Z libnpp-12.3.1.54     | 93.4 MB   | #####      |  50% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0187202Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:25:39.0187539Z 
2025-05-07T20:25:39.0187543Z 
2025-05-07T20:25:39.0187547Z 
2025-05-07T20:25:39.0187551Z 
2025-05-07T20:25:39.0187554Z 
2025-05-07T20:25:39.0187558Z 
2025-05-07T20:25:39.0187562Z 
2025-05-07T20:25:39.1145711Z libnpp-12.3.1.54     | 93.4 MB   | #####4     |  54% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1239579Z nsight-compute-2024. | 443.1 MB  | #########3 |  93% 
2025-05-07T20:25:39.1239932Z 
2025-05-07T20:25:39.1239990Z 
2025-05-07T20:25:39.1240027Z 
2025-05-07T20:25:39.1240033Z 
2025-05-07T20:25:39.1240038Z 
2025-05-07T20:25:39.1240043Z 
2025-05-07T20:25:39.1240445Z 
2025-05-07T20:25:39.2154015Z libnpp-12.3.1.54     | 93.4 MB   | #####7     |  58% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2246156Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:39.2246536Z 
2025-05-07T20:25:39.2246542Z 
2025-05-07T20:25:39.2246548Z 
2025-05-07T20:25:39.2246553Z 
2025-05-07T20:25:39.2246559Z 
2025-05-07T20:25:39.2246586Z 
2025-05-07T20:25:39.2249073Z 
2025-05-07T20:25:39.3188479Z libnpp-12.3.1.54     | 93.4 MB   | ######1    |  62% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4191050Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:39.4293812Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:25:39.4294167Z 
2025-05-07T20:25:39.4294171Z 
2025-05-07T20:25:39.4294185Z 
2025-05-07T20:25:39.4294189Z 
2025-05-07T20:25:39.4294193Z 
2025-05-07T20:25:39.4294196Z 
2025-05-07T20:25:39.4297442Z 
2025-05-07T20:25:39.5213131Z libnpp-12.3.1.54     | 93.4 MB   | ######5    |  65% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5295179Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:25:39.5295428Z 
2025-05-07T20:25:39.5295432Z 
2025-05-07T20:25:39.5295436Z 
2025-05-07T20:25:39.5295440Z 
2025-05-07T20:25:39.5295443Z 
2025-05-07T20:25:39.5295455Z 
2025-05-07T20:25:39.5297090Z 
2025-05-07T20:25:39.6248297Z libnpp-12.3.1.54     | 93.4 MB   | ######8    |  69% [A[A[A[A[A[A[A
2025-05-07T20:25:39.6299202Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:25:39.6299748Z 
2025-05-07T20:25:39.6299753Z 
2025-05-07T20:25:39.6299756Z 
2025-05-07T20:25:39.6299760Z 
2025-05-07T20:25:39.6299764Z 
2025-05-07T20:25:39.6299767Z 
2025-05-07T20:25:39.6300707Z 
2025-05-07T20:25:39.7300796Z libnpp-12.3.1.54     | 93.4 MB   | #######2   |  73% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7301235Z 
2025-05-07T20:25:39.7301239Z 
2025-05-07T20:25:39.7301243Z 
2025-05-07T20:25:39.7301246Z 
2025-05-07T20:25:39.7301258Z 
2025-05-07T20:25:39.7301261Z 
2025-05-07T20:25:39.7302206Z 
2025-05-07T20:25:39.7304754Z libnpp-12.3.1.54     | 93.4 MB   | #######6   |  77% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7941050Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:25:39.7941555Z 
2025-05-07T20:25:39.7941560Z 
2025-05-07T20:25:39.7941564Z 
2025-05-07T20:25:39.7941567Z 
2025-05-07T20:25:39.7941571Z 
2025-05-07T20:25:39.7951644Z 
2025-05-07T20:25:39.8303764Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:39.8304195Z 
2025-05-07T20:25:39.8304201Z 
2025-05-07T20:25:39.8304206Z 
2025-05-07T20:25:39.8304211Z 
2025-05-07T20:25:39.8304217Z 
2025-05-07T20:25:39.8304222Z 
2025-05-07T20:25:39.8305257Z 
2025-05-07T20:25:39.8387221Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  80% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8701397Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:25:39.8701771Z 
2025-05-07T20:25:39.8701777Z 
2025-05-07T20:25:39.8701782Z 
2025-05-07T20:25:39.8701787Z 
2025-05-07T20:25:39.8701792Z 
2025-05-07T20:25:39.8701798Z 
2025-05-07T20:25:39.8701803Z 
2025-05-07T20:25:39.8702142Z 
2025-05-07T20:25:39.9408657Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.9409101Z 
2025-05-07T20:25:39.9409108Z 
2025-05-07T20:25:39.9409113Z 
2025-05-07T20:25:39.9409118Z 
2025-05-07T20:25:39.9409124Z 
2025-05-07T20:25:39.9409130Z 
2025-05-07T20:25:39.9410208Z 
2025-05-07T20:25:39.9707307Z libnpp-12.3.1.54     | 93.4 MB   | ########3  |  84% [A[A[A[A[A[A[A
2025-05-07T20:25:39.9707752Z 
2025-05-07T20:25:39.9707758Z 
2025-05-07T20:25:39.9707764Z 
2025-05-07T20:25:39.9707769Z 
2025-05-07T20:25:39.9707775Z 
2025-05-07T20:25:39.9707780Z 
2025-05-07T20:25:39.9707785Z 
2025-05-07T20:25:39.9707791Z 
2025-05-07T20:25:40.0552692Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 5          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.0553096Z 
2025-05-07T20:25:40.0553100Z 
2025-05-07T20:25:40.0553104Z 
2025-05-07T20:25:40.0553108Z 
2025-05-07T20:25:40.0553111Z 
2025-05-07T20:25:40.0553115Z 
2025-05-07T20:25:40.0566117Z 
2025-05-07T20:25:40.0710907Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  87% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0711295Z 
2025-05-07T20:25:40.0711301Z 
2025-05-07T20:25:40.0711306Z 
2025-05-07T20:25:40.0711312Z 
2025-05-07T20:25:40.0711316Z 
2025-05-07T20:25:40.0711322Z 
2025-05-07T20:25:40.0711337Z 
2025-05-07T20:25:40.0715923Z 
2025-05-07T20:25:40.1579915Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #2         |  12% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.1580363Z 
2025-05-07T20:25:40.1580369Z 
2025-05-07T20:25:40.1580387Z 
2025-05-07T20:25:40.1580393Z 
2025-05-07T20:25:40.1580398Z 
2025-05-07T20:25:40.1580403Z 
2025-05-07T20:25:40.1582443Z 
2025-05-07T20:25:40.1711749Z libnpp-12.3.1.54     | 93.4 MB   | #########  |  91% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1712164Z 
2025-05-07T20:25:40.1712170Z 
2025-05-07T20:25:40.1712175Z 
2025-05-07T20:25:40.1712181Z 
2025-05-07T20:25:40.1712186Z 
2025-05-07T20:25:40.1712191Z 
2025-05-07T20:25:40.1712196Z 
2025-05-07T20:25:40.1712202Z 
2025-05-07T20:25:40.2611460Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.2611893Z 
2025-05-07T20:25:40.2611898Z 
2025-05-07T20:25:40.2611904Z 
2025-05-07T20:25:40.2611909Z 
2025-05-07T20:25:40.2611928Z 
2025-05-07T20:25:40.2611933Z 
2025-05-07T20:25:40.2615474Z 
2025-05-07T20:25:40.2743142Z libnpp-12.3.1.54     | 93.4 MB   | #########4 |  94% [A[A[A[A[A[A[A
2025-05-07T20:25:40.2743788Z 
2025-05-07T20:25:40.2743793Z 
2025-05-07T20:25:40.2743799Z 
2025-05-07T20:25:40.2743804Z 
2025-05-07T20:25:40.2743809Z 
2025-05-07T20:25:40.2743814Z 
2025-05-07T20:25:40.2743820Z 
2025-05-07T20:25:40.2743830Z 
2025-05-07T20:25:40.3615119Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##5        |  26% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.3615558Z 
2025-05-07T20:25:40.3615564Z 
2025-05-07T20:25:40.3615569Z 
2025-05-07T20:25:40.3615574Z 
2025-05-07T20:25:40.3615580Z 
2025-05-07T20:25:40.3615585Z 
2025-05-07T20:25:40.3621556Z 
2025-05-07T20:25:40.3818623Z libnpp-12.3.1.54     | 93.4 MB   | #########7 |  98% [A[A[A[A[A[A[A
2025-05-07T20:25:40.3819023Z 
2025-05-07T20:25:40.3819028Z 
2025-05-07T20:25:40.3819034Z 
2025-05-07T20:25:40.3819039Z 
2025-05-07T20:25:40.3819044Z 
2025-05-07T20:25:40.3819050Z 
2025-05-07T20:25:40.3819055Z 
2025-05-07T20:25:40.3820407Z 
2025-05-07T20:25:40.4823163Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###1       |  32% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.4823544Z 
2025-05-07T20:25:40.4823551Z 
2025-05-07T20:25:40.4823556Z 
2025-05-07T20:25:40.4823561Z 
2025-05-07T20:25:40.4823566Z 
2025-05-07T20:25:40.4823572Z 
2025-05-07T20:25:40.4823577Z 
2025-05-07T20:25:40.4823582Z 
2025-05-07T20:25:40.5484843Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###9       |  39% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.5485158Z 
2025-05-07T20:25:40.5485162Z 
2025-05-07T20:25:40.5485166Z 
2025-05-07T20:25:40.5485169Z 
2025-05-07T20:25:40.5485173Z 
2025-05-07T20:25:40.6097538Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:40.6097956Z 
2025-05-07T20:25:40.6097962Z 
2025-05-07T20:25:40.6097967Z 
2025-05-07T20:25:40.6097973Z 
2025-05-07T20:25:40.6097978Z 
2025-05-07T20:25:40.6097983Z 
2025-05-07T20:25:40.6097989Z 
2025-05-07T20:25:40.6099906Z 
2025-05-07T20:25:40.6132138Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####5      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.6132580Z 
2025-05-07T20:25:40.6132586Z 
2025-05-07T20:25:40.6132853Z 
2025-05-07T20:25:40.6132870Z 
2025-05-07T20:25:40.6132875Z 
2025-05-07T20:25:40.6132880Z 
2025-05-07T20:25:40.6132886Z 
2025-05-07T20:25:40.6132891Z 
2025-05-07T20:25:40.6139711Z 
2025-05-07T20:25:40.7109617Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7109935Z 
2025-05-07T20:25:40.7109939Z 
2025-05-07T20:25:40.7109943Z 
2025-05-07T20:25:40.7109947Z 
2025-05-07T20:25:40.7109951Z 
2025-05-07T20:25:40.7109955Z 
2025-05-07T20:25:40.7109958Z 
2025-05-07T20:25:40.7109962Z 
2025-05-07T20:25:40.7205670Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####2     |  52% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7206019Z 
2025-05-07T20:25:40.7206023Z 
2025-05-07T20:25:40.7206027Z 
2025-05-07T20:25:40.7206030Z 
2025-05-07T20:25:40.7206034Z 
2025-05-07T20:25:40.7206038Z 
2025-05-07T20:25:40.7206041Z 
2025-05-07T20:25:40.7206055Z 
2025-05-07T20:25:40.7206627Z 
2025-05-07T20:25:40.8208158Z libcurand-10.3.7.77  | 39.9 MB   | 5          |   6% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8208489Z 
2025-05-07T20:25:40.8208494Z 
2025-05-07T20:25:40.8208498Z 
2025-05-07T20:25:40.8208501Z 
2025-05-07T20:25:40.8208505Z 
2025-05-07T20:25:40.8208509Z 
2025-05-07T20:25:40.8208513Z 
2025-05-07T20:25:40.8208517Z 
2025-05-07T20:25:40.8213635Z 
2025-05-07T20:25:40.8331856Z libcurand-10.3.7.77  | 39.9 MB   | #2         |  13% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8332150Z 
2025-05-07T20:25:40.8332153Z 
2025-05-07T20:25:40.8332164Z 
2025-05-07T20:25:40.8332168Z 
2025-05-07T20:25:40.8332172Z 
2025-05-07T20:25:40.8332176Z 
2025-05-07T20:25:40.8332179Z 
2025-05-07T20:25:40.8336869Z 
2025-05-07T20:25:40.9209930Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####8     |  58% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9210251Z 
2025-05-07T20:25:40.9210259Z 
2025-05-07T20:25:40.9210264Z 
2025-05-07T20:25:40.9210268Z 
2025-05-07T20:25:40.9210273Z 
2025-05-07T20:25:40.9210278Z 
2025-05-07T20:25:40.9210283Z 
2025-05-07T20:25:40.9210288Z 
2025-05-07T20:25:40.9213334Z 
2025-05-07T20:25:40.9335816Z libcurand-10.3.7.77  | 39.9 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9336429Z 
2025-05-07T20:25:40.9336433Z 
2025-05-07T20:25:40.9336437Z 
2025-05-07T20:25:40.9336441Z 
2025-05-07T20:25:40.9336444Z 
2025-05-07T20:25:40.9336456Z 
2025-05-07T20:25:40.9336460Z 
2025-05-07T20:25:40.9336463Z 
2025-05-07T20:25:41.0216160Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######4    |  65% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0216486Z 
2025-05-07T20:25:41.0216491Z 
2025-05-07T20:25:41.0216495Z 
2025-05-07T20:25:41.0216500Z 
2025-05-07T20:25:41.0216504Z 
2025-05-07T20:25:41.0216509Z 
2025-05-07T20:25:41.0216514Z 
2025-05-07T20:25:41.0216518Z 
2025-05-07T20:25:41.0217000Z 
2025-05-07T20:25:41.0342812Z libcurand-10.3.7.77  | 39.9 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0343153Z 
2025-05-07T20:25:41.0343157Z 
2025-05-07T20:25:41.0343161Z 
2025-05-07T20:25:41.0343165Z 
2025-05-07T20:25:41.0343168Z 
2025-05-07T20:25:41.0343172Z 
2025-05-07T20:25:41.0343175Z 
2025-05-07T20:25:41.0346684Z 
2025-05-07T20:25:41.1254639Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######1   |  71% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1254953Z 
2025-05-07T20:25:41.1254957Z 
2025-05-07T20:25:41.1254961Z 
2025-05-07T20:25:41.1254965Z 
2025-05-07T20:25:41.1254969Z 
2025-05-07T20:25:41.1254973Z 
2025-05-07T20:25:41.1254977Z 
2025-05-07T20:25:41.1254981Z 
2025-05-07T20:25:41.1257428Z 
2025-05-07T20:25:41.1344232Z libcurand-10.3.7.77  | 39.9 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1344529Z 
2025-05-07T20:25:41.1344533Z 
2025-05-07T20:25:41.1344537Z 
2025-05-07T20:25:41.1344541Z 
2025-05-07T20:25:41.1344546Z 
2025-05-07T20:25:41.1344550Z 
2025-05-07T20:25:41.1344554Z 
2025-05-07T20:25:41.1344565Z 
2025-05-07T20:25:41.2256530Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######8   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2256949Z 
2025-05-07T20:25:41.2256955Z 
2025-05-07T20:25:41.2256961Z 
2025-05-07T20:25:41.2256977Z 
2025-05-07T20:25:41.2256983Z 
2025-05-07T20:25:41.2257256Z 
2025-05-07T20:25:41.2257277Z 
2025-05-07T20:25:41.2257283Z 
2025-05-07T20:25:41.2259195Z 
2025-05-07T20:25:41.2419865Z libcurand-10.3.7.77  | 39.9 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2420293Z 
2025-05-07T20:25:41.2420299Z 
2025-05-07T20:25:41.2420305Z 
2025-05-07T20:25:41.2420310Z 
2025-05-07T20:25:41.2420315Z 
2025-05-07T20:25:41.2420320Z 
2025-05-07T20:25:41.2420326Z 
2025-05-07T20:25:41.2420331Z 
2025-05-07T20:25:41.3306094Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########5  |  85% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3306414Z 
2025-05-07T20:25:41.3306418Z 
2025-05-07T20:25:41.3306422Z 
2025-05-07T20:25:41.3306425Z 
2025-05-07T20:25:41.3306429Z 
2025-05-07T20:25:41.3306433Z 
2025-05-07T20:25:41.3306444Z 
2025-05-07T20:25:41.3306448Z 
2025-05-07T20:25:41.3306859Z 
2025-05-07T20:25:41.3571517Z libcurand-10.3.7.77  | 39.9 MB   | #####1     |  51% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3571865Z 
2025-05-07T20:25:41.3571877Z 
2025-05-07T20:25:41.3571897Z 
2025-05-07T20:25:41.3571908Z 
2025-05-07T20:25:41.3571912Z 
2025-05-07T20:25:41.3571915Z 
2025-05-07T20:25:41.3571919Z 
2025-05-07T20:25:41.3573598Z 
2025-05-07T20:25:41.4308419Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########1 |  92% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.4308757Z 
2025-05-07T20:25:41.4308762Z 
2025-05-07T20:25:41.4308767Z 
2025-05-07T20:25:41.4308772Z 
2025-05-07T20:25:41.4308778Z 
2025-05-07T20:25:41.4308785Z 
2025-05-07T20:25:41.4308790Z 
2025-05-07T20:25:41.4308795Z 
2025-05-07T20:25:41.4310861Z 
2025-05-07T20:25:41.4788347Z libcurand-10.3.7.77  | 39.9 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.4788738Z 
2025-05-07T20:25:41.4788742Z 
2025-05-07T20:25:41.4788746Z 
2025-05-07T20:25:41.4788749Z 
2025-05-07T20:25:41.4788753Z 
2025-05-07T20:25:41.4788757Z 
2025-05-07T20:25:41.4788760Z 
2025-05-07T20:25:41.4788764Z 
2025-05-07T20:25:41.5308911Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########7 |  98% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.5309231Z 
2025-05-07T20:25:41.5309491Z 
2025-05-07T20:25:41.5309495Z 
2025-05-07T20:25:41.5309499Z 
2025-05-07T20:25:41.5309503Z 
2025-05-07T20:25:41.5309507Z 
2025-05-07T20:25:41.5309511Z 
2025-05-07T20:25:41.5309514Z 
2025-05-07T20:25:41.5310212Z 
2025-05-07T20:25:41.6311601Z libcurand-10.3.7.77  | 39.9 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.6311908Z 
2025-05-07T20:25:41.6311912Z 
2025-05-07T20:25:41.6311916Z 
2025-05-07T20:25:41.6311920Z 
2025-05-07T20:25:41.6311923Z 
2025-05-07T20:25:41.6311927Z 
2025-05-07T20:25:41.6311931Z 
2025-05-07T20:25:41.6311935Z 
2025-05-07T20:25:41.6312454Z 
2025-05-07T20:25:41.7326870Z libcurand-10.3.7.77  | 39.9 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7327237Z 
2025-05-07T20:25:41.7327241Z 
2025-05-07T20:25:41.7327244Z 
2025-05-07T20:25:41.7327248Z 
2025-05-07T20:25:41.7327251Z 
2025-05-07T20:25:41.7327255Z 
2025-05-07T20:25:41.7327258Z 
2025-05-07T20:25:41.7327262Z 
2025-05-07T20:25:41.7333941Z 
2025-05-07T20:25:41.8329835Z libcurand-10.3.7.77  | 39.9 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8330274Z 
2025-05-07T20:25:41.8330280Z 
2025-05-07T20:25:41.8330286Z 
2025-05-07T20:25:41.8330291Z 
2025-05-07T20:25:41.8330296Z 
2025-05-07T20:25:41.8330301Z 
2025-05-07T20:25:41.8330306Z 
2025-05-07T20:25:41.8330311Z 
2025-05-07T20:25:41.8335013Z 
2025-05-07T20:25:41.8989974Z libcurand-10.3.7.77  | 39.9 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8990344Z 
2025-05-07T20:25:41.8990348Z 
2025-05-07T20:25:41.8994179Z 
2025-05-07T20:25:43.0118019Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:43.0118447Z 
2025-05-07T20:25:43.0118453Z 
2025-05-07T20:25:43.0118458Z 
2025-05-07T20:25:43.0118463Z 
2025-05-07T20:25:43.0118469Z 
2025-05-07T20:25:43.0118474Z 
2025-05-07T20:25:43.0118480Z 
2025-05-07T20:25:43.0119221Z 
2025-05-07T20:25:43.0685777Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0686123Z 
2025-05-07T20:25:43.0686398Z 
2025-05-07T20:25:43.0686404Z 
2025-05-07T20:25:43.0686408Z 
2025-05-07T20:25:43.0686412Z 
2025-05-07T20:25:43.0686415Z 
2025-05-07T20:25:43.0686419Z 
2025-05-07T20:25:43.0686423Z 
2025-05-07T20:25:43.0686427Z 
2025-05-07T20:25:43.0688307Z 
2025-05-07T20:25:43.1687151Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1687562Z 
2025-05-07T20:25:43.1687567Z 
2025-05-07T20:25:43.1687570Z 
2025-05-07T20:25:43.1687574Z 
2025-05-07T20:25:43.1687578Z 
2025-05-07T20:25:43.1687582Z 
2025-05-07T20:25:43.1687586Z 
2025-05-07T20:25:43.1687592Z 
2025-05-07T20:25:43.1687596Z 
2025-05-07T20:25:43.1687600Z 
2025-05-07T20:25:43.2438714Z gds-tools-1.11.1.6   | 37.8 MB   | 5          |   6% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2439152Z 
2025-05-07T20:25:43.2439159Z 
2025-05-07T20:25:43.2439164Z 
2025-05-07T20:25:43.2439169Z 
2025-05-07T20:25:43.2439188Z 
2025-05-07T20:25:43.2439194Z 
2025-05-07T20:25:43.2439200Z 
2025-05-07T20:25:43.2439236Z 
2025-05-07T20:25:43.2439259Z 
2025-05-07T20:25:43.2686418Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2686868Z 
2025-05-07T20:25:43.2686874Z 
2025-05-07T20:25:43.2686892Z 
2025-05-07T20:25:43.2686899Z 
2025-05-07T20:25:43.2686904Z 
2025-05-07T20:25:43.2686909Z 
2025-05-07T20:25:43.2686914Z 
2025-05-07T20:25:43.2686919Z 
2025-05-07T20:25:43.2686925Z 
2025-05-07T20:25:43.2686930Z 
2025-05-07T20:25:43.2889211Z gds-tools-1.11.1.6   | 37.8 MB   | #2         |  13% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2889567Z 
2025-05-07T20:25:43.2889573Z 
2025-05-07T20:25:43.2889577Z 
2025-05-07T20:25:43.2889582Z 
2025-05-07T20:25:43.2889587Z 
2025-05-07T20:25:43.2889592Z 
2025-05-07T20:25:43.2889598Z 
2025-05-07T20:25:43.2889602Z 
2025-05-07T20:25:43.2889606Z 
2025-05-07T20:25:43.2889609Z 
2025-05-07T20:25:43.2889613Z 
2025-05-07T20:25:43.3790283Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3790792Z 
2025-05-07T20:25:43.3791114Z 
2025-05-07T20:25:43.3791120Z 
2025-05-07T20:25:43.3791126Z 
2025-05-07T20:25:43.3791132Z 
2025-05-07T20:25:43.3791138Z 
2025-05-07T20:25:43.3791143Z 
2025-05-07T20:25:43.3791148Z 
2025-05-07T20:25:43.3791154Z 
2025-05-07T20:25:43.3791989Z 
2025-05-07T20:25:43.3894484Z gds-tools-1.11.1.6   | 37.8 MB   | #9         |  19% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3894829Z 
2025-05-07T20:25:43.3894833Z 
2025-05-07T20:25:43.3894836Z 
2025-05-07T20:25:43.3894840Z 
2025-05-07T20:25:43.3894844Z 
2025-05-07T20:25:43.3894847Z 
2025-05-07T20:25:43.3894859Z 
2025-05-07T20:25:43.3894863Z 
2025-05-07T20:25:43.3894867Z 
2025-05-07T20:25:43.3894871Z 
2025-05-07T20:25:43.3898984Z 
2025-05-07T20:25:43.4352620Z cuda-nvcc-tools-12.6 | 23.0 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4353079Z 
2025-05-07T20:25:43.4792919Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:43.4793204Z 
2025-05-07T20:25:43.4793209Z 
2025-05-07T20:25:43.4793263Z 
2025-05-07T20:25:43.4793268Z 
2025-05-07T20:25:43.4793272Z 
2025-05-07T20:25:43.4793276Z 
2025-05-07T20:25:43.4793280Z 
2025-05-07T20:25:43.4793284Z 
2025-05-07T20:25:43.4793287Z 
2025-05-07T20:25:43.4796807Z 
2025-05-07T20:25:43.5097102Z gds-tools-1.11.1.6   | 37.8 MB   | ##6        |  27% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5097423Z 
2025-05-07T20:25:43.5097427Z 
2025-05-07T20:25:43.5097431Z 
2025-05-07T20:25:43.5097435Z 
2025-05-07T20:25:43.5097438Z 
2025-05-07T20:25:43.5097442Z 
2025-05-07T20:25:43.5097446Z 
2025-05-07T20:25:43.5097450Z 
2025-05-07T20:25:43.5097454Z 
2025-05-07T20:25:43.5097458Z 
2025-05-07T20:25:43.5099086Z 
2025-05-07T20:25:43.5154235Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5154677Z 
2025-05-07T20:25:43.5154683Z 
2025-05-07T20:25:43.5154688Z 
2025-05-07T20:25:43.5154694Z 
2025-05-07T20:25:43.5154699Z 
2025-05-07T20:25:43.5154705Z 
2025-05-07T20:25:43.5158441Z 
2025-05-07T20:25:43.5210608Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:43.5211081Z 
2025-05-07T20:25:43.5211087Z 
2025-05-07T20:25:43.5211091Z 
2025-05-07T20:25:43.5211095Z 
2025-05-07T20:25:43.5211099Z 
2025-05-07T20:25:43.5211102Z 
2025-05-07T20:25:43.5211106Z 
2025-05-07T20:25:43.5211109Z 
2025-05-07T20:25:43.5211113Z 
2025-05-07T20:25:43.5211117Z 
2025-05-07T20:25:43.5211121Z 
2025-05-07T20:25:43.5213035Z 
2025-05-07T20:25:43.5724177Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5724492Z 
2025-05-07T20:25:43.5724497Z 
2025-05-07T20:25:43.5724501Z 
2025-05-07T20:25:43.5724505Z 
2025-05-07T20:25:43.5724509Z 
2025-05-07T20:25:43.5724512Z 
2025-05-07T20:25:43.5724516Z 
2025-05-07T20:25:43.5724520Z 
2025-05-07T20:25:43.5724524Z 
2025-05-07T20:25:43.5724528Z 
2025-05-07T20:25:43.5724538Z 
2025-05-07T20:25:43.5724542Z 
2025-05-07T20:25:43.5726265Z 
2025-05-07T20:25:43.5799313Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5799669Z 
2025-05-07T20:25:43.5799684Z 
2025-05-07T20:25:43.5799689Z 
2025-05-07T20:25:43.5799694Z 
2025-05-07T20:25:43.5799699Z 
2025-05-07T20:25:43.5799704Z 
2025-05-07T20:25:43.5799710Z 
2025-05-07T20:25:43.5799715Z 
2025-05-07T20:25:43.5799720Z 
2025-05-07T20:25:43.5799725Z 
2025-05-07T20:25:43.6167158Z gds-tools-1.11.1.6   | 37.8 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6167468Z 
2025-05-07T20:25:43.6167473Z 
2025-05-07T20:25:43.6167476Z 
2025-05-07T20:25:43.6167480Z 
2025-05-07T20:25:43.6167483Z 
2025-05-07T20:25:43.6167487Z 
2025-05-07T20:25:43.6167491Z 
2025-05-07T20:25:43.6167494Z 
2025-05-07T20:25:43.6167498Z 
2025-05-07T20:25:43.6167502Z 
2025-05-07T20:25:43.6167505Z 
2025-05-07T20:25:43.6211306Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####6      |  47% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6211666Z 
2025-05-07T20:25:43.6211671Z 
2025-05-07T20:25:43.6211676Z 
2025-05-07T20:25:43.6211700Z 
2025-05-07T20:25:43.6211974Z 
2025-05-07T20:25:43.6211979Z 
2025-05-07T20:25:43.6211985Z 
2025-05-07T20:25:43.6211990Z 
2025-05-07T20:25:43.6211995Z 
2025-05-07T20:25:43.6212000Z 
2025-05-07T20:25:43.6212005Z 
2025-05-07T20:25:43.6213228Z 
2025-05-07T20:25:43.6725040Z python-3.9.18        | 22.7 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6725492Z 
2025-05-07T20:25:43.6725498Z 
2025-05-07T20:25:43.6725504Z 
2025-05-07T20:25:43.6725530Z 
2025-05-07T20:25:43.6725535Z 
2025-05-07T20:25:43.6725541Z 
2025-05-07T20:25:43.6725546Z 
2025-05-07T20:25:43.6725551Z 
2025-05-07T20:25:43.6725557Z 
2025-05-07T20:25:43.6725562Z 
2025-05-07T20:25:43.6725567Z 
2025-05-07T20:25:43.6725572Z 
2025-05-07T20:25:43.6728835Z 
2025-05-07T20:25:43.6835124Z cuda-nvrtc-12.6.85   | 17.3 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6835442Z 
2025-05-07T20:25:43.6835446Z 
2025-05-07T20:25:43.6835450Z 
2025-05-07T20:25:43.6835455Z 
2025-05-07T20:25:43.6835458Z 
2025-05-07T20:25:43.6835492Z 
2025-05-07T20:25:43.6835496Z 
2025-05-07T20:25:43.6835500Z 
2025-05-07T20:25:43.6835504Z 
2025-05-07T20:25:43.6835508Z 
2025-05-07T20:25:43.7253995Z gds-tools-1.11.1.6   | 37.8 MB   | ####       |  40% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7254372Z 
2025-05-07T20:25:43.7254376Z 
2025-05-07T20:25:43.7254380Z 
2025-05-07T20:25:43.7254384Z 
2025-05-07T20:25:43.7254388Z 
2025-05-07T20:25:43.7254391Z 
2025-05-07T20:25:43.7254395Z 
2025-05-07T20:25:43.7254399Z 
2025-05-07T20:25:43.7254403Z 
2025-05-07T20:25:43.7254406Z 
2025-05-07T20:25:43.7254410Z 
2025-05-07T20:25:43.7255714Z 
2025-05-07T20:25:43.7316038Z python-3.9.18        | 22.7 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7316434Z 
2025-05-07T20:25:43.7316439Z 
2025-05-07T20:25:43.7316442Z 
2025-05-07T20:25:43.7316446Z 
2025-05-07T20:25:43.7316450Z 
2025-05-07T20:25:43.7316455Z 
2025-05-07T20:25:43.7316459Z 
2025-05-07T20:25:43.7316463Z 
2025-05-07T20:25:43.7316466Z 
2025-05-07T20:25:43.7316735Z 
2025-05-07T20:25:43.7318839Z 
2025-05-07T20:25:43.7731368Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7731683Z 
2025-05-07T20:25:43.7731687Z 
2025-05-07T20:25:43.7731691Z 
2025-05-07T20:25:43.7731694Z 
2025-05-07T20:25:43.7731698Z 
2025-05-07T20:25:43.7731702Z 
2025-05-07T20:25:43.7731706Z 
2025-05-07T20:25:43.7731717Z 
2025-05-07T20:25:43.7731720Z 
2025-05-07T20:25:43.7731724Z 
2025-05-07T20:25:43.7731728Z 
2025-05-07T20:25:43.7731731Z 
2025-05-07T20:25:43.7735027Z 
2025-05-07T20:25:43.7858075Z cuda-nvrtc-12.6.85   | 17.3 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7858390Z 
2025-05-07T20:25:43.7858394Z 
2025-05-07T20:25:43.7858398Z 
2025-05-07T20:25:43.7858402Z 
2025-05-07T20:25:43.7858405Z 
2025-05-07T20:25:43.7858409Z 
2025-05-07T20:25:43.7858413Z 
2025-05-07T20:25:43.7858416Z 
2025-05-07T20:25:43.7858420Z 
2025-05-07T20:25:43.7858542Z 
2025-05-07T20:25:43.8259936Z gds-tools-1.11.1.6   | 37.8 MB   | ####7      |  47% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8260284Z 
2025-05-07T20:25:43.8260288Z 
2025-05-07T20:25:43.8260292Z 
2025-05-07T20:25:43.8260296Z 
2025-05-07T20:25:43.8260300Z 
2025-05-07T20:25:43.8260303Z 
2025-05-07T20:25:43.8260307Z 
2025-05-07T20:25:43.8260311Z 
2025-05-07T20:25:43.8260315Z 
2025-05-07T20:25:43.8260318Z 
2025-05-07T20:25:43.8260322Z 
2025-05-07T20:25:43.8263244Z 
2025-05-07T20:25:43.8318959Z python-3.9.18        | 22.7 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8319380Z 
2025-05-07T20:25:43.8319386Z 
2025-05-07T20:25:43.8319392Z 
2025-05-07T20:25:43.8319398Z 
2025-05-07T20:25:43.8319403Z 
2025-05-07T20:25:43.8319408Z 
2025-05-07T20:25:43.8319413Z 
2025-05-07T20:25:43.8319419Z 
2025-05-07T20:25:43.8319424Z 
2025-05-07T20:25:43.8319429Z 
2025-05-07T20:25:43.8320976Z 
2025-05-07T20:25:43.8746891Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8747363Z 
2025-05-07T20:25:43.8747738Z 
2025-05-07T20:25:43.8747743Z 
2025-05-07T20:25:43.8747746Z 
2025-05-07T20:25:43.8747750Z 
2025-05-07T20:25:43.8747754Z 
2025-05-07T20:25:43.8747758Z 
2025-05-07T20:25:43.8747773Z 
2025-05-07T20:25:43.8747777Z 
2025-05-07T20:25:43.8747781Z 
2025-05-07T20:25:43.8747784Z 
2025-05-07T20:25:43.8747788Z 
2025-05-07T20:25:43.8756096Z 
2025-05-07T20:25:43.8862178Z cuda-nvrtc-12.6.85   | 17.3 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8862553Z 
2025-05-07T20:25:43.8862559Z 
2025-05-07T20:25:43.8862564Z 
2025-05-07T20:25:43.8862569Z 
2025-05-07T20:25:43.8862575Z 
2025-05-07T20:25:43.8862580Z 
2025-05-07T20:25:43.8862585Z 
2025-05-07T20:25:43.8862590Z 
2025-05-07T20:25:43.8862595Z 
2025-05-07T20:25:43.8862601Z 
2025-05-07T20:25:43.9390712Z gds-tools-1.11.1.6   | 37.8 MB   | #####3     |  54% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9391034Z 
2025-05-07T20:25:43.9391038Z 
2025-05-07T20:25:43.9391043Z 
2025-05-07T20:25:43.9391046Z 
2025-05-07T20:25:43.9391068Z 
2025-05-07T20:25:43.9391084Z 
2025-05-07T20:25:43.9391088Z 
2025-05-07T20:25:43.9391091Z 
2025-05-07T20:25:43.9391095Z 
2025-05-07T20:25:43.9391099Z 
2025-05-07T20:25:43.9391102Z 
2025-05-07T20:25:43.9397889Z 
2025-05-07T20:25:43.9526844Z python-3.9.18        | 22.7 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9527250Z 
2025-05-07T20:25:43.9527256Z 
2025-05-07T20:25:43.9527261Z 
2025-05-07T20:25:43.9527266Z 
2025-05-07T20:25:43.9527272Z 
2025-05-07T20:25:43.9527277Z 
2025-05-07T20:25:43.9527294Z 
2025-05-07T20:25:43.9527300Z 
2025-05-07T20:25:43.9527305Z 
2025-05-07T20:25:43.9527310Z 
2025-05-07T20:25:43.9528670Z 
2025-05-07T20:25:43.9862179Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9862630Z 
2025-05-07T20:25:43.9862636Z 
2025-05-07T20:25:43.9862641Z 
2025-05-07T20:25:43.9862646Z 
2025-05-07T20:25:43.9862651Z 
2025-05-07T20:25:43.9862657Z 
2025-05-07T20:25:43.9862662Z 
2025-05-07T20:25:43.9862914Z 
2025-05-07T20:25:43.9862931Z 
2025-05-07T20:25:43.9862936Z 
2025-05-07T20:25:43.9868719Z gds-tools-1.11.1.6   | 37.8 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9869147Z 
2025-05-07T20:25:43.9869152Z 
2025-05-07T20:25:43.9869158Z 
2025-05-07T20:25:43.9869163Z 
2025-05-07T20:25:43.9869168Z 
2025-05-07T20:25:43.9869173Z 
2025-05-07T20:25:43.9869178Z 
2025-05-07T20:25:43.9869184Z 
2025-05-07T20:25:43.9869189Z 
2025-05-07T20:25:43.9869194Z 
2025-05-07T20:25:43.9869199Z 
2025-05-07T20:25:43.9869204Z 
2025-05-07T20:25:43.9869209Z 
2025-05-07T20:25:44.0473251Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0473675Z 
2025-05-07T20:25:44.0473681Z 
2025-05-07T20:25:44.0473686Z 
2025-05-07T20:25:44.0473692Z 
2025-05-07T20:25:44.0473697Z 
2025-05-07T20:25:44.0473702Z 
2025-05-07T20:25:44.0473707Z 
2025-05-07T20:25:44.0473725Z 
2025-05-07T20:25:44.0473730Z 
2025-05-07T20:25:44.0473740Z 
2025-05-07T20:25:44.0473745Z 
2025-05-07T20:25:44.0473776Z 
2025-05-07T20:25:44.0936625Z python-3.9.18        | 22.7 MB   | #####7     |  58% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0937032Z 
2025-05-07T20:25:44.0937038Z 
2025-05-07T20:25:44.0937043Z 
2025-05-07T20:25:44.0937049Z 
2025-05-07T20:25:44.0937054Z 
2025-05-07T20:25:44.0937059Z 
2025-05-07T20:25:44.0937064Z 
2025-05-07T20:25:44.0937069Z 
2025-05-07T20:25:44.0937075Z 
2025-05-07T20:25:44.0937080Z 
2025-05-07T20:25:44.0937085Z 
2025-05-07T20:25:44.0937090Z 
2025-05-07T20:25:44.0937100Z 
2025-05-07T20:25:44.1071099Z cuda-nvrtc-12.6.85   | 17.3 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1071691Z 
2025-05-07T20:25:44.1071697Z 
2025-05-07T20:25:44.1071702Z 
2025-05-07T20:25:44.1071707Z 
2025-05-07T20:25:44.1071712Z 
2025-05-07T20:25:44.1071718Z 
2025-05-07T20:25:44.1071723Z 
2025-05-07T20:25:44.1071728Z 
2025-05-07T20:25:44.1071733Z 
2025-05-07T20:25:44.1071738Z 
2025-05-07T20:25:44.1477373Z gds-tools-1.11.1.6   | 37.8 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1478066Z 
2025-05-07T20:25:44.1478072Z 
2025-05-07T20:25:44.1478077Z 
2025-05-07T20:25:44.1478083Z 
2025-05-07T20:25:44.1478088Z 
2025-05-07T20:25:44.1478093Z 
2025-05-07T20:25:44.1478098Z 
2025-05-07T20:25:44.1478116Z 
2025-05-07T20:25:44.1478122Z 
2025-05-07T20:25:44.1478127Z 
2025-05-07T20:25:44.1478132Z 
2025-05-07T20:25:44.1478137Z 
2025-05-07T20:25:44.1940850Z python-3.9.18        | 22.7 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1941323Z 
2025-05-07T20:25:44.1941339Z 
2025-05-07T20:25:44.1941345Z 
2025-05-07T20:25:44.1941350Z 
2025-05-07T20:25:44.1941355Z 
2025-05-07T20:25:44.1941367Z 
2025-05-07T20:25:44.1941372Z 
2025-05-07T20:25:44.1941377Z 
2025-05-07T20:25:44.1941383Z 
2025-05-07T20:25:44.1941388Z 
2025-05-07T20:25:44.1941393Z 
2025-05-07T20:25:44.1941398Z 
2025-05-07T20:25:44.1943259Z 
2025-05-07T20:25:44.2074456Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2074908Z 
2025-05-07T20:25:44.2074913Z 
2025-05-07T20:25:44.2074919Z 
2025-05-07T20:25:44.2074924Z 
2025-05-07T20:25:44.2074929Z 
2025-05-07T20:25:44.2074934Z 
2025-05-07T20:25:44.2074939Z 
2025-05-07T20:25:44.2074945Z 
2025-05-07T20:25:44.2074950Z 
2025-05-07T20:25:44.2074955Z 
2025-05-07T20:25:44.2488446Z gds-tools-1.11.1.6   | 37.8 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2488872Z 
2025-05-07T20:25:44.2488878Z 
2025-05-07T20:25:44.2488882Z 
2025-05-07T20:25:44.2488887Z 
2025-05-07T20:25:44.2488893Z 
2025-05-07T20:25:44.2488898Z 
2025-05-07T20:25:44.2488903Z 
2025-05-07T20:25:44.2488908Z 
2025-05-07T20:25:44.2488926Z 
2025-05-07T20:25:44.2488931Z 
2025-05-07T20:25:44.2488936Z 
2025-05-07T20:25:44.2498761Z 
2025-05-07T20:25:44.2953848Z python-3.9.18        | 22.7 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2954260Z 
2025-05-07T20:25:44.2954265Z 
2025-05-07T20:25:44.2954271Z 
2025-05-07T20:25:44.2954538Z 
2025-05-07T20:25:44.2954560Z 
2025-05-07T20:25:44.2954565Z 
2025-05-07T20:25:44.2954570Z 
2025-05-07T20:25:44.2954575Z 
2025-05-07T20:25:44.2954580Z 
2025-05-07T20:25:44.2954585Z 
2025-05-07T20:25:44.2954590Z 
2025-05-07T20:25:44.2954596Z 
2025-05-07T20:25:44.2954601Z 
2025-05-07T20:25:44.3332664Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3333091Z 
2025-05-07T20:25:44.3333097Z 
2025-05-07T20:25:44.3333102Z 
2025-05-07T20:25:44.3333107Z 
2025-05-07T20:25:44.3333113Z 
2025-05-07T20:25:44.3333118Z 
2025-05-07T20:25:44.3333123Z 
2025-05-07T20:25:44.3333129Z 
2025-05-07T20:25:44.3333134Z 
2025-05-07T20:25:44.3333153Z 
2025-05-07T20:25:44.3493663Z gds-tools-1.11.1.6   | 37.8 MB   | ########1  |  81% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3494074Z 
2025-05-07T20:25:44.3494079Z 
2025-05-07T20:25:44.3494084Z 
2025-05-07T20:25:44.3494102Z 
2025-05-07T20:25:44.3494107Z 
2025-05-07T20:25:44.3494112Z 
2025-05-07T20:25:44.3494117Z 
2025-05-07T20:25:44.3494150Z 
2025-05-07T20:25:44.3494156Z 
2025-05-07T20:25:44.3494162Z 
2025-05-07T20:25:44.3494167Z 
2025-05-07T20:25:44.3495680Z 
2025-05-07T20:25:44.4336120Z python-3.9.18        | 22.7 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4336549Z 
2025-05-07T20:25:44.4336556Z 
2025-05-07T20:25:44.4336561Z 
2025-05-07T20:25:44.4336566Z 
2025-05-07T20:25:44.4336571Z 
2025-05-07T20:25:44.4336577Z 
2025-05-07T20:25:44.4336584Z 
2025-05-07T20:25:44.4336590Z 
2025-05-07T20:25:44.4336595Z 
2025-05-07T20:25:44.4342646Z 
2025-05-07T20:25:44.5337039Z gds-tools-1.11.1.6   | 37.8 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5337467Z 
2025-05-07T20:25:44.5337471Z 
2025-05-07T20:25:44.5337475Z 
2025-05-07T20:25:44.5337478Z 
2025-05-07T20:25:44.5337482Z 
2025-05-07T20:25:44.5337486Z 
2025-05-07T20:25:44.5337490Z 
2025-05-07T20:25:44.5337494Z 
2025-05-07T20:25:44.5337501Z 
2025-05-07T20:25:44.5339222Z 
2025-05-07T20:25:44.8861978Z gds-tools-1.11.1.6   | 37.8 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8862668Z 
2025-05-07T20:25:44.8862673Z 
2025-05-07T20:25:44.8862690Z 
2025-05-07T20:25:44.8862695Z 
2025-05-07T20:25:44.8862700Z 
2025-05-07T20:25:44.8862706Z 
2025-05-07T20:25:44.8862711Z 
2025-05-07T20:25:44.8862716Z 
2025-05-07T20:25:44.8862721Z 
2025-05-07T20:25:44.8862726Z 
2025-05-07T20:25:44.8862981Z 
2025-05-07T20:25:44.9129241Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9129679Z 
2025-05-07T20:25:44.9129685Z 
2025-05-07T20:25:44.9129690Z 
2025-05-07T20:25:44.9129695Z 
2025-05-07T20:25:44.9129701Z 
2025-05-07T20:25:44.9129706Z 
2025-05-07T20:25:44.9129711Z 
2025-05-07T20:25:44.9129717Z 
2025-05-07T20:25:44.9129722Z 
2025-05-07T20:25:44.9129727Z 
2025-05-07T20:25:44.9129733Z 
2025-05-07T20:25:44.9129738Z 
2025-05-07T20:25:44.9130373Z 
2025-05-07T20:25:44.9328870Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9329316Z 
2025-05-07T20:25:44.9329322Z 
2025-05-07T20:25:44.9329328Z 
2025-05-07T20:25:44.9329333Z 
2025-05-07T20:25:44.9329338Z 
2025-05-07T20:25:44.9329343Z 
2025-05-07T20:25:44.9329348Z 
2025-05-07T20:25:44.9329354Z 
2025-05-07T20:25:44.9329359Z 
2025-05-07T20:25:44.9329375Z 
2025-05-07T20:25:44.9329381Z 
2025-05-07T20:25:44.9329386Z 
2025-05-07T20:25:44.9329392Z 
2025-05-07T20:25:44.9329401Z 
2025-05-07T20:25:44.9819227Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9819691Z 
2025-05-07T20:25:44.9819697Z 
2025-05-07T20:25:44.9819702Z 
2025-05-07T20:25:44.9819707Z 
2025-05-07T20:25:44.9819713Z 
2025-05-07T20:25:44.9819718Z 
2025-05-07T20:25:44.9819723Z 
2025-05-07T20:25:44.9819728Z 
2025-05-07T20:25:44.9819733Z 
2025-05-07T20:25:44.9819739Z 
2025-05-07T20:25:44.9819744Z 
2025-05-07T20:25:44.9819750Z 
2025-05-07T20:25:44.9819756Z 
2025-05-07T20:25:44.9819763Z 
2025-05-07T20:25:44.9820447Z 
2025-05-07T20:25:45.0329640Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0330117Z 
2025-05-07T20:25:45.0330124Z 
2025-05-07T20:25:45.0330130Z 
2025-05-07T20:25:45.0330136Z 
2025-05-07T20:25:45.0330143Z 
2025-05-07T20:25:45.0330148Z 
2025-05-07T20:25:45.0330154Z 
2025-05-07T20:25:45.0330159Z 
2025-05-07T20:25:45.0330165Z 
2025-05-07T20:25:45.0330188Z 
2025-05-07T20:25:45.0330194Z 
2025-05-07T20:25:45.0330199Z 
2025-05-07T20:25:45.0330204Z 
2025-05-07T20:25:45.0333027Z 
2025-05-07T20:25:45.0822068Z libnvjitlink-12.6.85 | 14.9 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0822526Z 
2025-05-07T20:25:45.0822531Z 
2025-05-07T20:25:45.0822537Z 
2025-05-07T20:25:45.0822542Z 
2025-05-07T20:25:45.0822547Z 
2025-05-07T20:25:45.0822552Z 
2025-05-07T20:25:45.0822557Z 
2025-05-07T20:25:45.0822562Z 
2025-05-07T20:25:45.0822569Z 
2025-05-07T20:25:45.0822574Z 
2025-05-07T20:25:45.0822580Z 
2025-05-07T20:25:45.0822584Z 
2025-05-07T20:25:45.0822603Z 
2025-05-07T20:25:45.0822615Z 
2025-05-07T20:25:45.0822620Z 
2025-05-07T20:25:45.1461649Z cuda-nvcc-dev_linux- | 10.8 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1462101Z 
2025-05-07T20:25:45.1462107Z 
2025-05-07T20:25:45.1462112Z 
2025-05-07T20:25:45.1462117Z 
2025-05-07T20:25:45.1462122Z 
2025-05-07T20:25:45.1462127Z 
2025-05-07T20:25:45.1462133Z 
2025-05-07T20:25:45.1462138Z 
2025-05-07T20:25:45.1462143Z 
2025-05-07T20:25:45.1462148Z 
2025-05-07T20:25:45.1462153Z 
2025-05-07T20:25:45.1462178Z 
2025-05-07T20:25:45.1462183Z 
2025-05-07T20:25:45.1466236Z 
2025-05-07T20:25:45.1881456Z libnvjitlink-12.6.85 | 14.9 MB   | ####3      |  43% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1881813Z 
2025-05-07T20:25:45.1881817Z 
2025-05-07T20:25:45.1881821Z 
2025-05-07T20:25:45.1881824Z 
2025-05-07T20:25:45.1881828Z 
2025-05-07T20:25:45.1881832Z 
2025-05-07T20:25:45.1881836Z 
2025-05-07T20:25:45.1881840Z 
2025-05-07T20:25:45.1881843Z 
2025-05-07T20:25:45.1881861Z 
2025-05-07T20:25:45.1882090Z 
2025-05-07T20:25:45.1882093Z 
2025-05-07T20:25:45.1882097Z 
2025-05-07T20:25:45.1882101Z 
2025-05-07T20:25:45.1882104Z 
2025-05-07T20:25:45.2323464Z cuda-nvcc-dev_linux- | 10.8 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2323860Z 
2025-05-07T20:25:45.2323864Z 
2025-05-07T20:25:45.2323868Z 
2025-05-07T20:25:45.2323871Z 
2025-05-07T20:25:45.2323875Z 
2025-05-07T20:25:45.2323878Z 
2025-05-07T20:25:45.2323882Z 
2025-05-07T20:25:45.2323885Z 
2025-05-07T20:25:45.2323888Z 
2025-05-07T20:25:45.2323892Z 
2025-05-07T20:25:45.2323895Z 
2025-05-07T20:25:45.2326579Z 
2025-05-07T20:25:45.2518860Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2519234Z 
2025-05-07T20:25:45.2519239Z 
2025-05-07T20:25:45.2519242Z 
2025-05-07T20:25:45.2519246Z 
2025-05-07T20:25:45.2519258Z 
2025-05-07T20:25:45.2519262Z 
2025-05-07T20:25:45.2519265Z 
2025-05-07T20:25:45.2519269Z 
2025-05-07T20:25:45.2519292Z 
2025-05-07T20:25:45.2519296Z 
2025-05-07T20:25:45.2519299Z 
2025-05-07T20:25:45.2519303Z 
2025-05-07T20:25:45.2519307Z 
2025-05-07T20:25:45.2521174Z 
2025-05-07T20:25:45.2885676Z libnvjitlink-12.6.85 | 14.9 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2886116Z 
2025-05-07T20:25:45.2886124Z 
2025-05-07T20:25:45.2886129Z 
2025-05-07T20:25:45.2886134Z 
2025-05-07T20:25:45.2886140Z 
2025-05-07T20:25:45.2886145Z 
2025-05-07T20:25:45.2886150Z 
2025-05-07T20:25:45.2886166Z 
2025-05-07T20:25:45.2886171Z 
2025-05-07T20:25:45.2886177Z 
2025-05-07T20:25:45.2886182Z 
2025-05-07T20:25:45.2886187Z 
2025-05-07T20:25:45.2886193Z 
2025-05-07T20:25:45.2886198Z 
2025-05-07T20:25:45.2886203Z 
2025-05-07T20:25:45.3204703Z cuda-nvcc-dev_linux- | 10.8 MB   | ########6  |  87% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3205121Z 
2025-05-07T20:25:45.3205125Z 
2025-05-07T20:25:45.3205129Z 
2025-05-07T20:25:45.3205133Z 
2025-05-07T20:25:45.3205381Z 
2025-05-07T20:25:45.3205395Z 
2025-05-07T20:25:45.3205399Z 
2025-05-07T20:25:45.3205403Z 
2025-05-07T20:25:45.3205407Z 
2025-05-07T20:25:45.3205410Z 
2025-05-07T20:25:45.3205414Z 
2025-05-07T20:25:45.3205417Z 
2025-05-07T20:25:45.3205421Z 
2025-05-07T20:25:45.3205425Z 
2025-05-07T20:25:45.3205428Z 
2025-05-07T20:25:45.3205432Z 
2025-05-07T20:25:45.3524374Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3524705Z 
2025-05-07T20:25:45.3524709Z 
2025-05-07T20:25:45.3524713Z 
2025-05-07T20:25:45.3524716Z 
2025-05-07T20:25:45.3524720Z 
2025-05-07T20:25:45.3524724Z 
2025-05-07T20:25:45.3524728Z 
2025-05-07T20:25:45.3524731Z 
2025-05-07T20:25:45.3524744Z 
2025-05-07T20:25:45.3524748Z 
2025-05-07T20:25:45.3524751Z 
2025-05-07T20:25:45.3524755Z 
2025-05-07T20:25:45.3524759Z 
2025-05-07T20:25:45.3524763Z 
2025-05-07T20:25:45.4207031Z libnvjitlink-12.6.85 | 14.9 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4207380Z 
2025-05-07T20:25:45.4207393Z 
2025-05-07T20:25:45.4207397Z 
2025-05-07T20:25:45.4207401Z 
2025-05-07T20:25:45.4207405Z 
2025-05-07T20:25:45.4207409Z 
2025-05-07T20:25:45.4207412Z 
2025-05-07T20:25:45.4207416Z 
2025-05-07T20:25:45.4207420Z 
2025-05-07T20:25:45.4207423Z 
2025-05-07T20:25:45.4207427Z 
2025-05-07T20:25:45.4207431Z 
2025-05-07T20:25:45.4207434Z 
2025-05-07T20:25:45.4207438Z 
2025-05-07T20:25:45.4207442Z 
2025-05-07T20:25:45.4207446Z 
2025-05-07T20:25:45.5207378Z cuda-nvvm-tools-12.6 | 10.4 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5207802Z 
2025-05-07T20:25:45.5207808Z 
2025-05-07T20:25:45.5207813Z 
2025-05-07T20:25:45.5207819Z 
2025-05-07T20:25:45.5207824Z 
2025-05-07T20:25:45.5207830Z 
2025-05-07T20:25:45.5207835Z 
2025-05-07T20:25:45.5207840Z 
2025-05-07T20:25:45.5207846Z 
2025-05-07T20:25:45.5207861Z 
2025-05-07T20:25:45.5207868Z 
2025-05-07T20:25:45.5207873Z 
2025-05-07T20:25:45.5207877Z 
2025-05-07T20:25:45.5207881Z 
2025-05-07T20:25:45.5208193Z 
2025-05-07T20:25:45.5210426Z 
2025-05-07T20:25:45.6729452Z cuda-nvvm-tools-12.6 | 10.4 MB   | #####8     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6729799Z 
2025-05-07T20:25:45.6729803Z 
2025-05-07T20:25:45.6729807Z 
2025-05-07T20:25:45.6729812Z 
2025-05-07T20:25:45.6729816Z 
2025-05-07T20:25:45.6729820Z 
2025-05-07T20:25:45.6729824Z 
2025-05-07T20:25:45.6729828Z 
2025-05-07T20:25:45.6729831Z 
2025-05-07T20:25:45.6729835Z 
2025-05-07T20:25:45.6729839Z 
2025-05-07T20:25:45.6729843Z 
2025-05-07T20:25:45.6729846Z 
2025-05-07T20:25:45.6729850Z 
2025-05-07T20:25:45.6729854Z 
2025-05-07T20:25:45.6729857Z 
2025-05-07T20:25:45.7119630Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########8  |  88% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7119961Z 
2025-05-07T20:25:45.7121409Z 
2025-05-07T20:25:45.7238408Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:45.7238677Z 
2025-05-07T20:25:45.7238681Z 
2025-05-07T20:25:45.7238704Z 
2025-05-07T20:25:45.7238718Z 
2025-05-07T20:25:45.7238721Z 
2025-05-07T20:25:45.7238725Z 
2025-05-07T20:25:45.7238729Z 
2025-05-07T20:25:45.7238733Z 
2025-05-07T20:25:45.7238736Z 
2025-05-07T20:25:45.7238740Z 
2025-05-07T20:25:45.7238753Z 
2025-05-07T20:25:45.7238757Z 
2025-05-07T20:25:45.7238761Z 
2025-05-07T20:25:45.7238764Z 
2025-05-07T20:25:45.7238768Z 
2025-05-07T20:25:45.7739155Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7739492Z 
2025-05-07T20:25:45.7739496Z 
2025-05-07T20:25:45.7739500Z 
2025-05-07T20:25:45.7739504Z 
2025-05-07T20:25:45.7739515Z 
2025-05-07T20:25:45.7739519Z 
2025-05-07T20:25:45.7739523Z 
2025-05-07T20:25:45.7739527Z 
2025-05-07T20:25:45.7739530Z 
2025-05-07T20:25:45.7739534Z 
2025-05-07T20:25:45.7739538Z 
2025-05-07T20:25:45.7739541Z 
2025-05-07T20:25:45.7739545Z 
2025-05-07T20:25:45.7739549Z 
2025-05-07T20:25:45.7739555Z 
2025-05-07T20:25:45.7739561Z 
2025-05-07T20:25:45.7745374Z 
2025-05-07T20:25:45.8740827Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8741342Z 
2025-05-07T20:25:45.8741346Z 
2025-05-07T20:25:45.8741350Z 
2025-05-07T20:25:45.8741354Z 
2025-05-07T20:25:45.8741358Z 
2025-05-07T20:25:45.8741361Z 
2025-05-07T20:25:45.8741365Z 
2025-05-07T20:25:45.8741369Z 
2025-05-07T20:25:45.8741373Z 
2025-05-07T20:25:45.8741376Z 
2025-05-07T20:25:45.8741380Z 
2025-05-07T20:25:45.8741384Z 
2025-05-07T20:25:45.8741387Z 
2025-05-07T20:25:45.8741391Z 
2025-05-07T20:25:45.8741401Z 
2025-05-07T20:25:45.8741405Z 
2025-05-07T20:25:45.8743918Z 
2025-05-07T20:25:45.8928105Z cuda-sanitizer-api-1 | 8.9 MB    | ###5       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8928449Z 
2025-05-07T20:25:45.8928453Z 
2025-05-07T20:25:45.8928457Z 
2025-05-07T20:25:45.8928461Z 
2025-05-07T20:25:45.8928464Z 
2025-05-07T20:25:45.8928468Z 
2025-05-07T20:25:45.8928472Z 
2025-05-07T20:25:45.8928475Z 
2025-05-07T20:25:45.8928487Z 
2025-05-07T20:25:45.8930658Z 
2025-05-07T20:25:45.9356744Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9357044Z 
2025-05-07T20:25:45.9357048Z 
2025-05-07T20:25:45.9357052Z 
2025-05-07T20:25:45.9357055Z 
2025-05-07T20:25:45.9357059Z 
2025-05-07T20:25:45.9357063Z 
2025-05-07T20:25:45.9357066Z 
2025-05-07T20:25:45.9357070Z 
2025-05-07T20:25:45.9357073Z 
2025-05-07T20:25:45.9357077Z 
2025-05-07T20:25:45.9357081Z 
2025-05-07T20:25:45.9357084Z 
2025-05-07T20:25:45.9357088Z 
2025-05-07T20:25:45.9357091Z 
2025-05-07T20:25:45.9588226Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9588547Z 
2025-05-07T20:25:45.9588551Z 
2025-05-07T20:25:45.9588554Z 
2025-05-07T20:25:45.9588558Z 
2025-05-07T20:25:45.9588562Z 
2025-05-07T20:25:45.9588565Z 
2025-05-07T20:25:45.9588578Z 
2025-05-07T20:25:45.9588582Z 
2025-05-07T20:25:45.9588585Z 
2025-05-07T20:25:45.9588589Z 
2025-05-07T20:25:45.9588593Z 
2025-05-07T20:25:45.9588837Z 
2025-05-07T20:25:45.9588841Z 
2025-05-07T20:25:45.9588845Z 
2025-05-07T20:25:45.9588848Z 
2025-05-07T20:25:45.9588852Z 
2025-05-07T20:25:45.9588856Z 
2025-05-07T20:25:45.9588859Z 
2025-05-07T20:25:45.9733998Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9734427Z 
2025-05-07T20:25:45.9734431Z 
2025-05-07T20:25:45.9734435Z 
2025-05-07T20:25:45.9734439Z 
2025-05-07T20:25:45.9734442Z 
2025-05-07T20:25:45.9734446Z 
2025-05-07T20:25:45.9734450Z 
2025-05-07T20:25:45.9734454Z 
2025-05-07T20:25:45.9734457Z 
2025-05-07T20:25:45.9734461Z 
2025-05-07T20:25:45.9734465Z 
2025-05-07T20:25:45.9734468Z 
2025-05-07T20:25:45.9734472Z 
2025-05-07T20:25:45.9734476Z 
2025-05-07T20:25:45.9734480Z 
2025-05-07T20:25:45.9734491Z 
2025-05-07T20:25:45.9734494Z 
2025-05-07T20:25:45.9734498Z 
2025-05-07T20:25:45.9737144Z 
2025-05-07T20:25:45.9739314Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9739808Z 
2025-05-07T20:25:45.9739812Z 
2025-05-07T20:25:45.9739816Z 
2025-05-07T20:25:45.9739819Z 
2025-05-07T20:25:45.9739823Z 
2025-05-07T20:25:45.9739827Z 
2025-05-07T20:25:45.9739830Z 
2025-05-07T20:25:45.9739834Z 
2025-05-07T20:25:45.9739838Z 
2025-05-07T20:25:45.9739841Z 
2025-05-07T20:25:45.9739845Z 
2025-05-07T20:25:45.9739849Z 
2025-05-07T20:25:45.9739852Z 
2025-05-07T20:25:45.9739856Z 
2025-05-07T20:25:45.9739860Z 
2025-05-07T20:25:45.9739863Z 
2025-05-07T20:25:45.9741950Z 
2025-05-07T20:25:46.0598211Z cuda-sanitizer-api-1 | 8.9 MB    | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0598677Z 
2025-05-07T20:25:46.0598683Z 
2025-05-07T20:25:46.0598690Z 
2025-05-07T20:25:46.0598696Z 
2025-05-07T20:25:46.0598700Z 
2025-05-07T20:25:46.0598704Z 
2025-05-07T20:25:46.0598708Z 
2025-05-07T20:25:46.0598711Z 
2025-05-07T20:25:46.0598716Z 
2025-05-07T20:25:46.0598720Z 
2025-05-07T20:25:46.0598724Z 
2025-05-07T20:25:46.0598728Z 
2025-05-07T20:25:46.0599005Z 
2025-05-07T20:25:46.0599033Z 
2025-05-07T20:25:46.0599052Z 
2025-05-07T20:25:46.0599058Z 
2025-05-07T20:25:46.0599064Z 
2025-05-07T20:25:46.0600864Z 
2025-05-07T20:25:46.0736491Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###4       |  34% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0736941Z 
2025-05-07T20:25:46.0736947Z 
2025-05-07T20:25:46.0736952Z 
2025-05-07T20:25:46.0736957Z 
2025-05-07T20:25:46.0736962Z 
2025-05-07T20:25:46.0736968Z 
2025-05-07T20:25:46.0736973Z 
2025-05-07T20:25:46.0736978Z 
2025-05-07T20:25:46.0736983Z 
2025-05-07T20:25:46.0736988Z 
2025-05-07T20:25:46.0736994Z 
2025-05-07T20:25:46.0736999Z 
2025-05-07T20:25:46.0737004Z 
2025-05-07T20:25:46.0737009Z 
2025-05-07T20:25:46.0737014Z 
2025-05-07T20:25:46.0737019Z 
2025-05-07T20:25:46.0737025Z 
2025-05-07T20:25:46.0737030Z 
2025-05-07T20:25:46.0738179Z 
2025-05-07T20:25:46.1031578Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1032077Z 
2025-05-07T20:25:46.1032094Z 
2025-05-07T20:25:46.1032105Z 
2025-05-07T20:25:46.1032109Z 
2025-05-07T20:25:46.1032112Z 
2025-05-07T20:25:46.1032116Z 
2025-05-07T20:25:46.1032132Z 
2025-05-07T20:25:46.1032136Z 
2025-05-07T20:25:46.1032139Z 
2025-05-07T20:25:46.1032143Z 
2025-05-07T20:25:46.1032147Z 
2025-05-07T20:25:46.1032151Z 
2025-05-07T20:25:46.1032154Z 
2025-05-07T20:25:46.1032158Z 
2025-05-07T20:25:46.1032162Z 
2025-05-07T20:25:46.1032166Z 
2025-05-07T20:25:46.1598570Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1598932Z 
2025-05-07T20:25:46.1598936Z 
2025-05-07T20:25:46.1598940Z 
2025-05-07T20:25:46.1598944Z 
2025-05-07T20:25:46.1598947Z 
2025-05-07T20:25:46.1598951Z 
2025-05-07T20:25:46.1598955Z 
2025-05-07T20:25:46.1598959Z 
2025-05-07T20:25:46.1598962Z 
2025-05-07T20:25:46.1598966Z 
2025-05-07T20:25:46.1598970Z 
2025-05-07T20:25:46.1598974Z 
2025-05-07T20:25:46.1598977Z 
2025-05-07T20:25:46.1598981Z 
2025-05-07T20:25:46.1598985Z 
2025-05-07T20:25:46.1599262Z 
2025-05-07T20:25:46.1599267Z 
2025-05-07T20:25:46.1599284Z 
2025-05-07T20:25:46.2239211Z cuda-nvvm-impl-12.6. | 7.7 MB    | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2239568Z 
2025-05-07T20:25:46.2239573Z 
2025-05-07T20:25:46.2239588Z 
2025-05-07T20:25:46.2239592Z 
2025-05-07T20:25:46.2239597Z 
2025-05-07T20:25:46.2239602Z 
2025-05-07T20:25:46.2239606Z 
2025-05-07T20:25:46.2239610Z 
2025-05-07T20:25:46.2239613Z 
2025-05-07T20:25:46.2239618Z 
2025-05-07T20:25:46.2239622Z 
2025-05-07T20:25:46.2239626Z 
2025-05-07T20:25:46.2239630Z 
2025-05-07T20:25:46.2239633Z 
2025-05-07T20:25:46.2239637Z 
2025-05-07T20:25:46.2239641Z 
2025-05-07T20:25:46.2239644Z 
2025-05-07T20:25:46.2239648Z 
2025-05-07T20:25:46.2240711Z 
2025-05-07T20:25:46.3371889Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3372288Z 
2025-05-07T20:25:46.3372293Z 
2025-05-07T20:25:46.3372297Z 
2025-05-07T20:25:46.3372300Z 
2025-05-07T20:25:46.3372338Z 
2025-05-07T20:25:46.3372343Z 
2025-05-07T20:25:46.3372347Z 
2025-05-07T20:25:46.3372350Z 
2025-05-07T20:25:46.3372354Z 
2025-05-07T20:25:46.3372358Z 
2025-05-07T20:25:46.3372361Z 
2025-05-07T20:25:46.3372365Z 
2025-05-07T20:25:46.3372369Z 
2025-05-07T20:25:46.3372372Z 
2025-05-07T20:25:46.3372376Z 
2025-05-07T20:25:46.3372379Z 
2025-05-07T20:25:46.3372383Z 
2025-05-07T20:25:46.4772288Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4772666Z 
2025-05-07T20:25:46.4772670Z 
2025-05-07T20:25:46.4772674Z 
2025-05-07T20:25:46.4772677Z 
2025-05-07T20:25:46.4772690Z 
2025-05-07T20:25:46.4772694Z 
2025-05-07T20:25:46.4772698Z 
2025-05-07T20:25:46.4772701Z 
2025-05-07T20:25:46.4772705Z 
2025-05-07T20:25:46.4772709Z 
2025-05-07T20:25:46.4772712Z 
2025-05-07T20:25:46.4772716Z 
2025-05-07T20:25:46.4772720Z 
2025-05-07T20:25:46.4772724Z 
2025-05-07T20:25:46.4772727Z 
2025-05-07T20:25:46.4772732Z 
2025-05-07T20:25:46.4772973Z 
2025-05-07T20:25:46.4772990Z 
2025-05-07T20:25:46.8224647Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8225134Z 
2025-05-07T20:25:46.8225139Z 
2025-05-07T20:25:46.8225143Z 
2025-05-07T20:25:46.8225146Z 
2025-05-07T20:25:46.8225151Z 
2025-05-07T20:25:46.8225156Z 
2025-05-07T20:25:47.0376764Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:47.0377231Z 
2025-05-07T20:25:47.0377237Z 
2025-05-07T20:25:47.0377242Z 
2025-05-07T20:25:47.0377248Z 
2025-05-07T20:25:47.0377253Z 
2025-05-07T20:25:47.0377258Z 
2025-05-07T20:25:47.0377263Z 
2025-05-07T20:25:47.0377268Z 
2025-05-07T20:25:48.0870021Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:48.0870440Z 
2025-05-07T20:25:48.0870446Z 
2025-05-07T20:25:48.0870451Z 
2025-05-07T20:25:48.0870457Z 
2025-05-07T20:25:48.0870462Z 
2025-05-07T20:25:48.1739376Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:48.1739706Z 
2025-05-07T20:25:48.1739710Z 
2025-05-07T20:25:48.1739714Z 
2025-05-07T20:25:48.1739725Z 
2025-05-07T20:25:48.1739729Z 
2025-05-07T20:25:48.1739733Z 
2025-05-07T20:25:48.1739736Z 
2025-05-07T20:25:48.1739740Z 
2025-05-07T20:25:48.1739744Z 
2025-05-07T20:25:49.0290060Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.0337484Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:49.0337869Z 
2025-05-07T20:25:49.0337876Z 
2025-05-07T20:25:49.0337881Z 
2025-05-07T20:25:49.0337886Z 
2025-05-07T20:25:49.0337892Z 
2025-05-07T20:25:49.0337897Z 
2025-05-07T20:25:49.0337902Z 
2025-05-07T20:25:49.0337907Z 
2025-05-07T20:25:49.0337913Z 
2025-05-07T20:25:49.0337919Z 
2025-05-07T20:25:49.0337925Z 
2025-05-07T20:25:49.2833876Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2834287Z 
2025-05-07T20:25:49.2834291Z 
2025-05-07T20:25:49.2834295Z 
2025-05-07T20:25:49.2834615Z 
2025-05-07T20:25:49.2834621Z 
2025-05-07T20:25:49.2834626Z 
2025-05-07T20:25:49.2834631Z 
2025-05-07T20:25:49.2834636Z 
2025-05-07T20:25:49.2834641Z 
2025-05-07T20:25:49.2834659Z 
2025-05-07T20:25:49.2834664Z 
2025-05-07T20:25:49.2834669Z 
2025-05-07T20:25:49.2834680Z 
2025-05-07T20:25:49.7340913Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7341505Z 
2025-05-07T20:25:49.7341511Z 
2025-05-07T20:25:49.7341517Z 
2025-05-07T20:25:49.7341522Z 
2025-05-07T20:25:49.7341529Z 
2025-05-07T20:25:49.7341534Z 
2025-05-07T20:25:49.7341539Z 
2025-05-07T20:25:50.0659294Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:50.0659703Z 
2025-05-07T20:25:50.0659708Z 
2025-05-07T20:25:50.0659711Z 
2025-05-07T20:25:50.0659715Z 
2025-05-07T20:25:50.0659719Z 
2025-05-07T20:25:50.0659725Z 
2025-05-07T20:25:50.0659732Z 
2025-05-07T20:25:50.0659737Z 
2025-05-07T20:25:50.0659743Z 
2025-05-07T20:25:50.0659775Z 
2025-05-07T20:25:50.0659799Z 
2025-05-07T20:25:50.0659804Z 
2025-05-07T20:25:50.0659809Z 
2025-05-07T20:25:50.0659815Z 
2025-05-07T20:25:50.0659820Z 
2025-05-07T20:25:50.4141310Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4141918Z 
2025-05-07T20:25:50.4141924Z 
2025-05-07T20:25:50.4141930Z 
2025-05-07T20:25:50.4141936Z 
2025-05-07T20:25:50.4141941Z 
2025-05-07T20:25:50.4141947Z 
2025-05-07T20:25:50.4141954Z 
2025-05-07T20:25:50.4141961Z 
2025-05-07T20:25:50.4141978Z 
2025-05-07T20:25:50.4141984Z 
2025-05-07T20:25:50.6211548Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6211904Z 
2025-05-07T20:25:50.6211908Z 
2025-05-07T20:25:50.6211912Z 
2025-05-07T20:25:50.6211931Z 
2025-05-07T20:25:50.6211935Z 
2025-05-07T20:25:50.6211939Z 
2025-05-07T20:25:50.6211943Z 
2025-05-07T20:25:50.6211947Z 
2025-05-07T20:25:50.6211951Z 
2025-05-07T20:25:50.6211955Z 
2025-05-07T20:25:50.6212219Z 
2025-05-07T20:25:50.6212247Z 
2025-05-07T20:25:50.7052768Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.7053187Z 
2025-05-07T20:25:50.7053193Z 
2025-05-07T20:25:50.7053198Z 
2025-05-07T20:25:50.7053202Z 
2025-05-07T20:25:50.7053207Z 
2025-05-07T20:25:50.7053213Z 
2025-05-07T20:25:50.7053217Z 
2025-05-07T20:25:50.7053229Z 
2025-05-07T20:25:50.7053234Z 
2025-05-07T20:25:50.7053241Z 
2025-05-07T20:25:50.7053247Z 
2025-05-07T20:25:50.7053255Z 
2025-05-07T20:25:50.7053264Z 
2025-05-07T20:25:50.7053271Z 
2025-05-07T20:25:50.8231149Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8231494Z 
2025-05-07T20:25:50.8231499Z 
2025-05-07T20:25:50.8231503Z 
2025-05-07T20:25:50.8231507Z 
2025-05-07T20:25:50.8231511Z 
2025-05-07T20:25:50.8231515Z 
2025-05-07T20:25:50.8231520Z 
2025-05-07T20:25:50.8231525Z 
2025-05-07T20:25:50.8231535Z 
2025-05-07T20:25:50.8231540Z 
2025-05-07T20:25:50.8231578Z 
2025-05-07T20:25:50.8231594Z 
2025-05-07T20:25:50.8231598Z 
2025-05-07T20:25:50.8231603Z 
2025-05-07T20:25:50.8231608Z 
2025-05-07T20:25:50.8231612Z 
2025-05-07T20:25:50.9094579Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9095074Z 
2025-05-07T20:25:50.9095078Z 
2025-05-07T20:25:50.9095082Z 
2025-05-07T20:25:50.9095086Z 
2025-05-07T20:25:50.9095089Z 
2025-05-07T20:25:50.9095093Z 
2025-05-07T20:25:50.9095097Z 
2025-05-07T20:25:50.9095101Z 
2025-05-07T20:25:50.9095104Z 
2025-05-07T20:25:50.9095108Z 
2025-05-07T20:25:50.9095112Z 
2025-05-07T20:25:50.9095116Z 
2025-05-07T20:25:50.9095119Z 
2025-05-07T20:25:50.9095123Z 
2025-05-07T20:25:50.9095127Z 
2025-05-07T20:25:50.9095133Z 
2025-05-07T20:25:50.9095137Z 
2025-05-07T20:25:50.9095141Z 
2025-05-07T20:25:50.9095145Z 
2025-05-07T20:25:51.0523155Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.0523778Z 
2025-05-07T20:25:51.0523821Z 
2025-05-07T20:25:51.0524260Z 
2025-05-07T20:25:51.0524267Z 
2025-05-07T20:25:51.0524275Z 
2025-05-07T20:25:51.0524282Z 
2025-05-07T20:25:51.0524290Z 
2025-05-07T20:25:51.0524297Z 
2025-05-07T20:25:51.0524304Z 
2025-05-07T20:25:51.0524312Z 
2025-05-07T20:25:51.0524319Z 
2025-05-07T20:25:51.0524326Z 
2025-05-07T20:25:51.0524334Z 
2025-05-07T20:25:51.0524341Z 
2025-05-07T20:25:51.0524348Z 
2025-05-07T20:25:51.0524356Z 
2025-05-07T20:25:51.0524363Z 
2025-05-07T20:25:51.2312095Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.2312463Z 
2025-05-07T20:25:51.2312468Z 
2025-05-07T20:25:51.2312472Z 
2025-05-07T20:25:51.2312475Z 
2025-05-07T20:25:51.2312479Z 
2025-05-07T20:25:51.2312483Z 
2025-05-07T20:25:51.2312486Z 
2025-05-07T20:25:51.2312490Z 
2025-05-07T20:25:51.2312494Z 
2025-05-07T20:25:51.2312498Z 
2025-05-07T20:25:51.2312501Z 
2025-05-07T20:25:51.2312505Z 
2025-05-07T20:25:51.2312519Z 
2025-05-07T20:25:51.2312523Z 
2025-05-07T20:25:51.2312550Z 
2025-05-07T20:25:51.2312572Z 
2025-05-07T20:25:51.2312575Z 
2025-05-07T20:25:51.2312583Z 
2025-05-07T20:25:51.5527577Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.5528034Z 
2025-05-07T20:25:57.0702186Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:57.0709183Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:57.0709478Z 
2025-05-07T20:25:57.0709483Z 
2025-05-07T20:25:57.0709488Z 
2025-05-07T20:25:57.0709492Z 
2025-05-07T20:25:57.0709498Z 
2025-05-07T20:25:57.0709502Z 
2025-05-07T20:25:57.0709506Z 
2025-05-07T20:25:57.0709510Z 
2025-05-07T20:25:57.0709513Z 
2025-05-07T20:25:57.0709517Z 
2025-05-07T20:25:57.0709521Z 
2025-05-07T20:25:57.0709524Z 
2025-05-07T20:25:57.0709528Z 
2025-05-07T20:25:57.0709532Z 
2025-05-07T20:25:57.0709536Z 
2025-05-07T20:25:57.0709539Z 
2025-05-07T20:25:57.0709553Z 
2025-05-07T20:25:57.0709557Z 
2025-05-07T20:25:57.0709560Z 
2025-05-07T20:25:57.0709901Z                       
2025-05-07T20:25:57.0710278Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0710603Z                                                      
2025-05-07T20:25:57.0710808Z 
2025-05-07T20:25:57.0710973Z                                                      [A
2025-05-07T20:25:57.0711185Z 
2025-05-07T20:25:57.0711189Z 
2025-05-07T20:25:57.0711376Z                                                      [A[A
2025-05-07T20:25:57.0711598Z 
2025-05-07T20:25:57.0711603Z 
2025-05-07T20:25:57.0711609Z 
2025-05-07T20:25:57.0712202Z                                                      [A[A[A
2025-05-07T20:25:57.0712415Z 
2025-05-07T20:25:57.0712421Z 
2025-05-07T20:25:57.0712425Z 
2025-05-07T20:25:57.0712429Z 
2025-05-07T20:25:57.0713212Z                                                      [A[A[A[A
2025-05-07T20:25:57.0713537Z 
2025-05-07T20:25:57.0713543Z 
2025-05-07T20:25:57.0713553Z 
2025-05-07T20:25:57.0713559Z 
2025-05-07T20:25:57.0713564Z 
2025-05-07T20:25:57.0713877Z                                                      [A[A[A[A[A
2025-05-07T20:25:57.0714104Z 
2025-05-07T20:25:57.0714112Z 
2025-05-07T20:25:57.0714115Z 
2025-05-07T20:25:57.0714119Z 
2025-05-07T20:25:57.0714123Z 
2025-05-07T20:25:57.0714126Z 
2025-05-07T20:25:57.0714763Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:57.0715093Z 
2025-05-07T20:25:57.0715098Z 
2025-05-07T20:25:57.0715103Z 
2025-05-07T20:25:57.0715108Z 
2025-05-07T20:25:57.0715114Z 
2025-05-07T20:25:57.0715119Z 
2025-05-07T20:25:57.0715125Z 
2025-05-07T20:25:57.0715392Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:57.0715704Z 
2025-05-07T20:25:57.0715714Z 
2025-05-07T20:25:57.0715720Z 
2025-05-07T20:25:57.0715725Z 
2025-05-07T20:25:57.0715730Z 
2025-05-07T20:25:57.0715736Z 
2025-05-07T20:25:57.0715741Z 
2025-05-07T20:25:57.0715747Z 
2025-05-07T20:25:57.0716046Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0716456Z 
2025-05-07T20:25:57.0716460Z 
2025-05-07T20:25:57.0716464Z 
2025-05-07T20:25:57.0716467Z 
2025-05-07T20:25:57.0716471Z 
2025-05-07T20:25:57.0716485Z 
2025-05-07T20:25:57.0716488Z 
2025-05-07T20:25:57.0716492Z 
2025-05-07T20:25:57.0716495Z 
2025-05-07T20:25:57.0716771Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0716993Z 
2025-05-07T20:25:57.0717008Z 
2025-05-07T20:25:57.0717011Z 
2025-05-07T20:25:57.0717015Z 
2025-05-07T20:25:57.0717019Z 
2025-05-07T20:25:57.0717022Z 
2025-05-07T20:25:57.0717026Z 
2025-05-07T20:25:57.0717029Z 
2025-05-07T20:25:57.0717033Z 
2025-05-07T20:25:57.0717037Z 
2025-05-07T20:25:57.0717523Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0717753Z 
2025-05-07T20:25:57.0717761Z 
2025-05-07T20:25:57.0717765Z 
2025-05-07T20:25:57.0717768Z 
2025-05-07T20:25:57.0717772Z 
2025-05-07T20:25:57.0717775Z 
2025-05-07T20:25:57.0717779Z 
2025-05-07T20:25:57.0717787Z 
2025-05-07T20:25:57.0717797Z 
2025-05-07T20:25:57.0717800Z 
2025-05-07T20:25:57.0717804Z 
2025-05-07T20:25:57.0718504Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0718762Z 
2025-05-07T20:25:57.0718766Z 
2025-05-07T20:25:57.0718770Z 
2025-05-07T20:25:57.0718773Z 
2025-05-07T20:25:57.0718783Z 
2025-05-07T20:25:57.0718795Z 
2025-05-07T20:25:57.0718799Z 
2025-05-07T20:25:57.0718803Z 
2025-05-07T20:25:57.0718806Z 
2025-05-07T20:25:57.0718810Z 
2025-05-07T20:25:57.0718813Z 
2025-05-07T20:25:57.0718817Z 
2025-05-07T20:25:57.0719096Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0719348Z 
2025-05-07T20:25:57.0719357Z 
2025-05-07T20:25:57.0719361Z 
2025-05-07T20:25:57.0719365Z 
2025-05-07T20:25:57.0719369Z 
2025-05-07T20:25:57.0719372Z 
2025-05-07T20:25:57.0719376Z 
2025-05-07T20:25:57.0719379Z 
2025-05-07T20:25:57.0719383Z 
2025-05-07T20:25:57.0719387Z 
2025-05-07T20:25:57.0719526Z 
2025-05-07T20:25:57.0719537Z 
2025-05-07T20:25:57.0719541Z 
2025-05-07T20:25:57.0719989Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0720228Z 
2025-05-07T20:25:57.0720232Z 
2025-05-07T20:25:57.0720235Z 
2025-05-07T20:25:57.0720239Z 
2025-05-07T20:25:57.0720243Z 
2025-05-07T20:25:57.0720246Z 
2025-05-07T20:25:57.0720250Z 
2025-05-07T20:25:57.0720254Z 
2025-05-07T20:25:57.0720264Z 
2025-05-07T20:25:57.0720267Z 
2025-05-07T20:25:57.0720279Z 
2025-05-07T20:25:57.0720283Z 
2025-05-07T20:25:57.0720286Z 
2025-05-07T20:25:57.0720290Z 
2025-05-07T20:25:57.0720573Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0720893Z 
2025-05-07T20:25:57.0720899Z 
2025-05-07T20:25:57.0720913Z 
2025-05-07T20:25:57.0720918Z 
2025-05-07T20:25:57.0720923Z 
2025-05-07T20:25:57.0720937Z 
2025-05-07T20:25:57.0720942Z 
2025-05-07T20:25:57.0720947Z 
2025-05-07T20:25:57.0720953Z 
2025-05-07T20:25:57.0720970Z 
2025-05-07T20:25:57.0720982Z 
2025-05-07T20:25:57.0720988Z 
2025-05-07T20:25:57.0720993Z 
2025-05-07T20:25:57.0720999Z 
2025-05-07T20:25:57.0721004Z 
2025-05-07T20:25:57.0721468Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0721802Z 
2025-05-07T20:25:57.0721808Z 
2025-05-07T20:25:57.0721813Z 
2025-05-07T20:25:57.0721818Z 
2025-05-07T20:25:57.0721824Z 
2025-05-07T20:25:57.0721829Z 
2025-05-07T20:25:57.0721834Z 
2025-05-07T20:25:57.0721840Z 
2025-05-07T20:25:57.0721845Z 
2025-05-07T20:25:57.0721862Z 
2025-05-07T20:25:57.0721867Z 
2025-05-07T20:25:57.0721883Z 
2025-05-07T20:25:57.0721888Z 
2025-05-07T20:25:57.0721893Z 
2025-05-07T20:25:57.0721898Z 
2025-05-07T20:25:57.0721902Z 
2025-05-07T20:25:57.0722201Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0722535Z 
2025-05-07T20:25:57.0722550Z 
2025-05-07T20:25:57.0722555Z 
2025-05-07T20:25:57.0722561Z 
2025-05-07T20:25:57.0722697Z 
2025-05-07T20:25:57.0722702Z 
2025-05-07T20:25:57.0722708Z 
2025-05-07T20:25:57.0722713Z 
2025-05-07T20:25:57.0722718Z 
2025-05-07T20:25:57.0722723Z 
2025-05-07T20:25:57.0722728Z 
2025-05-07T20:25:57.0722734Z 
2025-05-07T20:25:57.0722739Z 
2025-05-07T20:25:57.0722744Z 
2025-05-07T20:25:57.0722749Z 
2025-05-07T20:25:57.0722754Z 
2025-05-07T20:25:57.0722760Z 
2025-05-07T20:25:57.0723080Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0723409Z 
2025-05-07T20:25:57.0723415Z 
2025-05-07T20:25:57.0723420Z 
2025-05-07T20:25:57.0723425Z 
2025-05-07T20:25:57.0723430Z 
2025-05-07T20:25:57.0723435Z 
2025-05-07T20:25:57.0723440Z 
2025-05-07T20:25:57.0723445Z 
2025-05-07T20:25:57.0723450Z 
2025-05-07T20:25:57.0723456Z 
2025-05-07T20:25:57.0723461Z 
2025-05-07T20:25:57.0723475Z 
2025-05-07T20:25:57.0723480Z 
2025-05-07T20:25:57.0723486Z 
2025-05-07T20:25:57.0723491Z 
2025-05-07T20:25:57.0723495Z 
2025-05-07T20:25:57.0723506Z 
2025-05-07T20:25:57.0723519Z 
2025-05-07T20:25:57.0724242Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0724494Z 
2025-05-07T20:25:57.0724500Z 
2025-05-07T20:25:57.0724604Z [A
2025-05-07T20:25:57.0724712Z 
2025-05-07T20:25:57.0724717Z 
2025-05-07T20:25:57.0725398Z [A[A
2025-05-07T20:25:57.0725532Z 
2025-05-07T20:25:57.0725536Z 
2025-05-07T20:25:57.0725543Z 
2025-05-07T20:25:57.0725994Z [A[A[A
2025-05-07T20:25:57.0726115Z 
2025-05-07T20:25:57.0726119Z 
2025-05-07T20:25:57.0726126Z 
2025-05-07T20:25:57.0726130Z 
2025-05-07T20:25:57.0726643Z [A[A[A[A
2025-05-07T20:25:57.0726800Z 
2025-05-07T20:25:57.0726806Z 
2025-05-07T20:25:57.0726815Z 
2025-05-07T20:25:57.0726820Z 
2025-05-07T20:25:57.0726826Z 
2025-05-07T20:25:57.0727322Z [A[A[A[A[A
2025-05-07T20:25:57.0727494Z 
2025-05-07T20:25:57.0727506Z 
2025-05-07T20:25:57.0727511Z 
2025-05-07T20:25:57.0727516Z 
2025-05-07T20:25:57.0727521Z 
2025-05-07T20:25:57.0727671Z 
2025-05-07T20:25:57.0728296Z [A[A[A[A[A[A
2025-05-07T20:25:57.0728534Z 
2025-05-07T20:25:57.0728540Z 
2025-05-07T20:25:57.0728546Z 
2025-05-07T20:25:57.0728552Z 
2025-05-07T20:25:57.0728570Z 
2025-05-07T20:25:57.0728576Z 
2025-05-07T20:25:57.0728581Z 
2025-05-07T20:25:57.0729101Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0729337Z 
2025-05-07T20:25:57.0729343Z 
2025-05-07T20:25:57.0729349Z 
2025-05-07T20:25:57.0729364Z 
2025-05-07T20:25:57.0729370Z 
2025-05-07T20:25:57.0729376Z 
2025-05-07T20:25:57.0729381Z 
2025-05-07T20:25:57.0729392Z 
2025-05-07T20:25:57.0729770Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0729942Z 
2025-05-07T20:25:57.0729946Z 
2025-05-07T20:25:57.0729950Z 
2025-05-07T20:25:57.0729954Z 
2025-05-07T20:25:57.0729957Z 
2025-05-07T20:25:57.0729961Z 
2025-05-07T20:25:57.0729965Z 
2025-05-07T20:25:57.0729968Z 
2025-05-07T20:25:57.0729975Z 
2025-05-07T20:25:57.0730597Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0730879Z 
2025-05-07T20:25:57.0730885Z 
2025-05-07T20:25:57.0730907Z 
2025-05-07T20:25:57.0730920Z 
2025-05-07T20:25:57.0730926Z 
2025-05-07T20:25:57.0730932Z 
2025-05-07T20:25:57.0730938Z 
2025-05-07T20:25:57.0730944Z 
2025-05-07T20:25:57.0730950Z 
2025-05-07T20:25:57.0730960Z 
2025-05-07T20:25:57.0731419Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0731715Z 
2025-05-07T20:25:57.0731721Z 
2025-05-07T20:25:57.0731727Z 
2025-05-07T20:25:57.0731732Z 
2025-05-07T20:25:57.0731738Z 
2025-05-07T20:25:57.0731744Z 
2025-05-07T20:25:57.0731750Z 
2025-05-07T20:25:57.0731759Z 
2025-05-07T20:25:57.0731764Z 
2025-05-07T20:25:57.0731770Z 
2025-05-07T20:25:57.0731774Z 
2025-05-07T20:25:57.0732189Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0732489Z 
2025-05-07T20:25:57.0732495Z 
2025-05-07T20:25:57.0732508Z 
2025-05-07T20:25:57.0732514Z 
2025-05-07T20:25:57.0732520Z 
2025-05-07T20:25:57.0732526Z 
2025-05-07T20:25:57.0732532Z 
2025-05-07T20:25:57.0732537Z 
2025-05-07T20:25:57.0732543Z 
2025-05-07T20:25:57.0732549Z 
2025-05-07T20:25:57.0732555Z 
2025-05-07T20:25:57.0732560Z 
2025-05-07T20:25:57.0733392Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0733689Z 
2025-05-07T20:25:57.0733696Z 
2025-05-07T20:25:57.0733702Z 
2025-05-07T20:25:57.0733708Z 
2025-05-07T20:25:57.0733714Z 
2025-05-07T20:25:57.0733720Z 
2025-05-07T20:25:57.0733726Z 
2025-05-07T20:25:57.0733741Z 
2025-05-07T20:25:57.0733747Z 
2025-05-07T20:25:57.0733753Z 
2025-05-07T20:25:57.0733759Z 
2025-05-07T20:25:57.0733765Z 
2025-05-07T20:25:57.0733771Z 
2025-05-07T20:25:57.0734023Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0734327Z 
2025-05-07T20:25:57.0734333Z 
2025-05-07T20:25:57.0734339Z 
2025-05-07T20:25:57.0734345Z 
2025-05-07T20:25:57.0734350Z 
2025-05-07T20:25:57.0734356Z 
2025-05-07T20:25:57.0734362Z 
2025-05-07T20:25:57.0734367Z 
2025-05-07T20:25:57.0734373Z 
2025-05-07T20:25:57.0734379Z 
2025-05-07T20:25:57.0734384Z 
2025-05-07T20:25:57.0734390Z 
2025-05-07T20:25:57.0734396Z 
2025-05-07T20:25:57.0734402Z 
2025-05-07T20:25:57.0734664Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0734968Z 
2025-05-07T20:25:57.0734975Z 
2025-05-07T20:25:57.0734980Z 
2025-05-07T20:25:57.0734986Z 
2025-05-07T20:25:57.0734992Z 
2025-05-07T20:25:57.0734997Z 
2025-05-07T20:25:57.0735003Z 
2025-05-07T20:25:57.0735009Z 
2025-05-07T20:25:57.0735015Z 
2025-05-07T20:25:57.0735020Z 
2025-05-07T20:25:57.0735026Z 
2025-05-07T20:25:57.0735032Z 
2025-05-07T20:25:57.0735038Z 
2025-05-07T20:25:57.0735043Z 
2025-05-07T20:25:57.0735049Z 
2025-05-07T20:25:57.0735795Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0736124Z 
2025-05-07T20:25:57.0736130Z 
2025-05-07T20:25:57.0736136Z 
2025-05-07T20:25:57.0736141Z 
2025-05-07T20:25:57.0736147Z 
2025-05-07T20:25:57.0736162Z 
2025-05-07T20:25:57.0736168Z 
2025-05-07T20:25:57.0736174Z 
2025-05-07T20:25:57.0736179Z 
2025-05-07T20:25:57.0736185Z 
2025-05-07T20:25:57.0736190Z 
2025-05-07T20:25:57.0736196Z 
2025-05-07T20:25:57.0736202Z 
2025-05-07T20:25:57.0736208Z 
2025-05-07T20:25:57.0736214Z 
2025-05-07T20:25:57.0736219Z 
2025-05-07T20:25:57.0736634Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0736983Z 
2025-05-07T20:25:57.0737000Z 
2025-05-07T20:25:57.0737005Z 
2025-05-07T20:25:57.0737011Z 
2025-05-07T20:25:57.0737017Z 
2025-05-07T20:25:57.0737023Z 
2025-05-07T20:25:57.0737029Z 
2025-05-07T20:25:57.0737035Z 
2025-05-07T20:25:57.0737042Z 
2025-05-07T20:25:57.0737048Z 
2025-05-07T20:25:57.0737054Z 
2025-05-07T20:25:57.0737060Z 
2025-05-07T20:25:57.0737066Z 
2025-05-07T20:25:57.0737072Z 
2025-05-07T20:25:57.0737078Z 
2025-05-07T20:25:57.0737084Z 
2025-05-07T20:25:57.0737090Z 
2025-05-07T20:25:57.0737377Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0737707Z 
2025-05-07T20:25:57.0737713Z 
2025-05-07T20:25:57.0737727Z 
2025-05-07T20:25:57.0737733Z 
2025-05-07T20:25:57.0737739Z 
2025-05-07T20:25:57.0737753Z 
2025-05-07T20:25:57.0737760Z 
2025-05-07T20:25:57.0737766Z 
2025-05-07T20:25:57.0737772Z 
2025-05-07T20:25:57.0737778Z 
2025-05-07T20:25:57.0737784Z 
2025-05-07T20:25:57.0737790Z 
2025-05-07T20:25:57.0737806Z 
2025-05-07T20:25:57.0737818Z 
2025-05-07T20:25:57.0737824Z 
2025-05-07T20:25:57.0737830Z 
2025-05-07T20:25:57.0737836Z 
2025-05-07T20:25:57.0737842Z 
2025-05-07T20:25:57.0738792Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0739132Z 
2025-05-07T20:25:57.0739143Z 
2025-05-07T20:25:57.0739331Z [A
2025-05-07T20:25:57.0749659Z 
2025-05-07T20:25:57.0749674Z 
2025-05-07T20:25:57.0749969Z [A[A
2025-05-07T20:25:57.0750108Z 
2025-05-07T20:25:57.0750114Z 
2025-05-07T20:25:57.0750165Z 
2025-05-07T20:25:57.0750325Z [A[A[A
2025-05-07T20:25:57.0750505Z 
2025-05-07T20:25:57.0750511Z 
2025-05-07T20:25:57.0750516Z 
2025-05-07T20:25:57.0750521Z 
2025-05-07T20:25:57.0750682Z [A[A[A[A
2025-05-07T20:25:57.0750848Z 
2025-05-07T20:25:57.0750854Z 
2025-05-07T20:25:57.0750859Z 
2025-05-07T20:25:57.0750877Z 
2025-05-07T20:25:57.0750882Z 
2025-05-07T20:25:57.0751035Z [A[A[A[A[A
2025-05-07T20:25:57.0751213Z 
2025-05-07T20:25:57.0751218Z 
2025-05-07T20:25:57.0751223Z 
2025-05-07T20:25:57.0751238Z 
2025-05-07T20:25:57.0751456Z 
2025-05-07T20:25:57.0751462Z 
2025-05-07T20:25:57.0751649Z [A[A[A[A[A[A
2025-05-07T20:25:57.0751832Z 
2025-05-07T20:25:57.0751837Z 
2025-05-07T20:25:57.0751842Z 
2025-05-07T20:25:57.0751848Z 
2025-05-07T20:25:57.0751852Z 
2025-05-07T20:25:57.0751858Z 
2025-05-07T20:25:57.0751863Z 
2025-05-07T20:25:57.0752036Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0752231Z 
2025-05-07T20:25:57.0752237Z 
2025-05-07T20:25:57.0752242Z 
2025-05-07T20:25:57.0752247Z 
2025-05-07T20:25:57.0752252Z 
2025-05-07T20:25:57.0752257Z 
2025-05-07T20:25:57.0752262Z 
2025-05-07T20:25:57.0752268Z 
2025-05-07T20:25:57.0752442Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0752656Z 
2025-05-07T20:25:57.0752662Z 
2025-05-07T20:25:57.0752667Z 
2025-05-07T20:25:57.0752672Z 
2025-05-07T20:25:57.0752677Z 
2025-05-07T20:25:57.0752682Z 
2025-05-07T20:25:57.0752687Z 
2025-05-07T20:25:57.0752708Z 
2025-05-07T20:25:57.0752713Z 
2025-05-07T20:25:57.0752892Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0753121Z 
2025-05-07T20:25:57.0753135Z 
2025-05-07T20:25:57.0753140Z 
2025-05-07T20:25:57.0753146Z 
2025-05-07T20:25:57.0753152Z 
2025-05-07T20:25:57.0753157Z 
2025-05-07T20:25:57.0753174Z 
2025-05-07T20:25:57.0753179Z 
2025-05-07T20:25:57.0753185Z 
2025-05-07T20:25:57.0753189Z 
2025-05-07T20:25:57.0753369Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0753610Z 
2025-05-07T20:25:57.0753617Z 
2025-05-07T20:25:57.0753624Z 
2025-05-07T20:25:57.0753631Z 
2025-05-07T20:25:57.0753647Z 
2025-05-07T20:25:57.0753653Z 
2025-05-07T20:25:57.0753660Z 
2025-05-07T20:25:57.0753666Z 
2025-05-07T20:25:57.0753673Z 
2025-05-07T20:25:57.0753679Z 
2025-05-07T20:25:57.0753686Z 
2025-05-07T20:25:57.0753894Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0754152Z 
2025-05-07T20:25:57.0754157Z 
2025-05-07T20:25:57.0754163Z 
2025-05-07T20:25:57.0754168Z 
2025-05-07T20:25:57.0754173Z 
2025-05-07T20:25:57.0754178Z 
2025-05-07T20:25:57.0754183Z 
2025-05-07T20:25:57.0754189Z 
2025-05-07T20:25:57.0754194Z 
2025-05-07T20:25:57.0754359Z 
2025-05-07T20:25:57.0754374Z 
2025-05-07T20:25:57.0754379Z 
2025-05-07T20:25:57.0754589Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0754864Z 
2025-05-07T20:25:57.0754870Z 
2025-05-07T20:25:57.0754875Z 
2025-05-07T20:25:57.0754881Z 
2025-05-07T20:25:57.0754887Z 
2025-05-07T20:25:57.0754891Z 
2025-05-07T20:25:57.0754896Z 
2025-05-07T20:25:57.0754901Z 
2025-05-07T20:25:57.0754906Z 
2025-05-07T20:25:57.0754911Z 
2025-05-07T20:25:57.0754916Z 
2025-05-07T20:25:57.0754921Z 
2025-05-07T20:25:57.0754927Z 
2025-05-07T20:25:57.0755135Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0755398Z 
2025-05-07T20:25:57.0755403Z 
2025-05-07T20:25:57.0755409Z 
2025-05-07T20:25:57.0755414Z 
2025-05-07T20:25:57.0755418Z 
2025-05-07T20:25:57.0755423Z 
2025-05-07T20:25:57.0755429Z 
2025-05-07T20:25:57.0755434Z 
2025-05-07T20:25:57.0755439Z 
2025-05-07T20:25:57.0755445Z 
2025-05-07T20:25:57.0755450Z 
2025-05-07T20:25:57.0755456Z 
2025-05-07T20:25:57.0755461Z 
2025-05-07T20:25:57.0755466Z 
2025-05-07T20:25:57.0755695Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0755968Z 
2025-05-07T20:25:57.0755974Z 
2025-05-07T20:25:57.0755979Z 
2025-05-07T20:25:57.0755984Z 
2025-05-07T20:25:57.0755989Z 
2025-05-07T20:25:57.0755994Z 
2025-05-07T20:25:57.0755999Z 
2025-05-07T20:25:57.0756012Z 
2025-05-07T20:25:57.0756017Z 
2025-05-07T20:25:57.0756023Z 
2025-05-07T20:25:57.0756028Z 
2025-05-07T20:25:57.0756033Z 
2025-05-07T20:25:57.0756038Z 
2025-05-07T20:25:57.0756043Z 
2025-05-07T20:25:57.0756048Z 
2025-05-07T20:25:57.0756257Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0756548Z 
2025-05-07T20:25:57.0756553Z 
2025-05-07T20:25:57.0756559Z 
2025-05-07T20:25:57.0756564Z 
2025-05-07T20:25:57.0756569Z 
2025-05-07T20:25:57.0756574Z 
2025-05-07T20:25:57.0756579Z 
2025-05-07T20:25:57.0756584Z 
2025-05-07T20:25:57.0756590Z 
2025-05-07T20:25:57.0756595Z 
2025-05-07T20:25:57.0756600Z 
2025-05-07T20:25:57.0756605Z 
2025-05-07T20:25:57.0756610Z 
2025-05-07T20:25:57.0756615Z 
2025-05-07T20:25:57.0756626Z 
2025-05-07T20:25:57.0756788Z 
2025-05-07T20:25:57.0757013Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0757301Z 
2025-05-07T20:25:57.0757306Z 
2025-05-07T20:25:57.0757312Z 
2025-05-07T20:25:57.0757317Z 
2025-05-07T20:25:57.0757322Z 
2025-05-07T20:25:57.0757327Z 
2025-05-07T20:25:57.0757333Z 
2025-05-07T20:25:57.0757338Z 
2025-05-07T20:25:57.0757343Z 
2025-05-07T20:25:57.0757349Z 
2025-05-07T20:25:57.0757354Z 
2025-05-07T20:25:57.0757359Z 
2025-05-07T20:25:57.0757364Z 
2025-05-07T20:25:57.0757370Z 
2025-05-07T20:25:57.0757387Z 
2025-05-07T20:25:57.0757393Z 
2025-05-07T20:25:57.0757398Z 
2025-05-07T20:25:57.0757620Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0757914Z 
2025-05-07T20:25:57.0757919Z 
2025-05-07T20:25:57.0757925Z 
2025-05-07T20:25:57.0757930Z 
2025-05-07T20:25:57.0757944Z 
2025-05-07T20:25:57.0757949Z 
2025-05-07T20:25:57.0757954Z 
2025-05-07T20:25:57.0757959Z 
2025-05-07T20:25:57.0757965Z 
2025-05-07T20:25:57.0757970Z 
2025-05-07T20:25:57.0757981Z 
2025-05-07T20:25:57.0757993Z 
2025-05-07T20:25:57.0757998Z 
2025-05-07T20:25:57.0758004Z 
2025-05-07T20:25:57.0758009Z 
2025-05-07T20:25:57.0758014Z 
2025-05-07T20:25:57.0758019Z 
2025-05-07T20:25:57.0758024Z 
2025-05-07T20:25:57.0758294Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0758590Z 
2025-05-07T20:25:57.0758595Z 
2025-05-07T20:25:57.0758738Z [A
2025-05-07T20:25:57.0758898Z 
2025-05-07T20:25:57.0758903Z 
2025-05-07T20:25:57.0759048Z [A[A
2025-05-07T20:25:57.0759206Z 
2025-05-07T20:25:57.0759211Z 
2025-05-07T20:25:57.0759216Z 
2025-05-07T20:25:57.0759367Z [A[A[A
2025-05-07T20:25:57.0759517Z 
2025-05-07T20:25:57.0759523Z 
2025-05-07T20:25:57.0759528Z 
2025-05-07T20:25:57.0759544Z 
2025-05-07T20:25:57.0759695Z [A[A[A[A
2025-05-07T20:25:57.0759865Z 
2025-05-07T20:25:57.0759871Z 
2025-05-07T20:25:57.0759876Z 
2025-05-07T20:25:57.0759881Z 
2025-05-07T20:25:57.0759886Z 
2025-05-07T20:25:57.0760053Z [A[A[A[A[A
2025-05-07T20:25:57.0760226Z 
2025-05-07T20:25:57.0760340Z 
2025-05-07T20:25:57.0760346Z 
2025-05-07T20:25:57.0760351Z 
2025-05-07T20:25:57.0760356Z 
2025-05-07T20:25:57.0760362Z 
2025-05-07T20:25:57.0760532Z [A[A[A[A[A[A
2025-05-07T20:25:57.0760720Z 
2025-05-07T20:25:57.0760726Z 
2025-05-07T20:25:57.0760731Z 
2025-05-07T20:25:57.0760736Z 
2025-05-07T20:25:57.0760741Z 
2025-05-07T20:25:57.0760745Z 
2025-05-07T20:25:57.0760748Z 
2025-05-07T20:25:57.0760880Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0761039Z 
2025-05-07T20:25:57.0761045Z 
2025-05-07T20:25:57.0761050Z 
2025-05-07T20:25:57.0761055Z 
2025-05-07T20:25:57.0761061Z 
2025-05-07T20:25:57.0761066Z 
2025-05-07T20:25:57.0761071Z 
2025-05-07T20:25:57.0761076Z 
2025-05-07T20:25:57.0761266Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0761436Z 
2025-05-07T20:25:57.0761442Z 
2025-05-07T20:25:57.0761448Z 
2025-05-07T20:25:57.0761453Z 
2025-05-07T20:25:57.0761458Z 
2025-05-07T20:25:57.0761464Z 
2025-05-07T20:25:57.0761469Z 
2025-05-07T20:25:57.0761474Z 
2025-05-07T20:25:57.0761479Z 
2025-05-07T20:25:57.0761671Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0761836Z 
2025-05-07T20:25:57.0761840Z 
2025-05-07T20:25:57.0761843Z 
2025-05-07T20:25:57.0761847Z 
2025-05-07T20:25:57.0761851Z 
2025-05-07T20:25:57.0761854Z 
2025-05-07T20:25:57.0761858Z 
2025-05-07T20:25:57.0761862Z 
2025-05-07T20:25:57.0761865Z 
2025-05-07T20:25:57.0761869Z 
2025-05-07T20:25:57.0762006Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0762236Z 
2025-05-07T20:25:57.0762240Z 
2025-05-07T20:25:57.0762243Z 
2025-05-07T20:25:57.0762247Z 
2025-05-07T20:25:57.0762251Z 
2025-05-07T20:25:57.0762254Z 
2025-05-07T20:25:57.0762258Z 
2025-05-07T20:25:57.0762261Z 
2025-05-07T20:25:57.0762275Z 
2025-05-07T20:25:57.0762278Z 
2025-05-07T20:25:57.0762282Z 
2025-05-07T20:25:57.0762417Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0762594Z 
2025-05-07T20:25:57.0762598Z 
2025-05-07T20:25:57.0762602Z 
2025-05-07T20:25:57.0762606Z 
2025-05-07T20:25:57.0762622Z 
2025-05-07T20:25:57.0762626Z 
2025-05-07T20:25:57.0762631Z 
2025-05-07T20:25:57.0762644Z 
2025-05-07T20:25:57.0762766Z 
2025-05-07T20:25:57.0762772Z 
2025-05-07T20:25:57.0762777Z 
2025-05-07T20:25:57.0762782Z 
2025-05-07T20:25:57.0762941Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0763123Z 
2025-05-07T20:25:57.0763132Z 
2025-05-07T20:25:57.0763136Z 
2025-05-07T20:25:57.0763139Z 
2025-05-07T20:25:57.0763143Z 
2025-05-07T20:25:57.0763147Z 
2025-05-07T20:25:57.0763150Z 
2025-05-07T20:25:57.0763154Z 
2025-05-07T20:25:57.0763158Z 
2025-05-07T20:25:57.0763161Z 
2025-05-07T20:25:57.0763165Z 
2025-05-07T20:25:57.0763169Z 
2025-05-07T20:25:57.0763172Z 
2025-05-07T20:25:57.0763367Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0763581Z 
2025-05-07T20:25:57.0763584Z 
2025-05-07T20:25:57.0763588Z 
2025-05-07T20:25:57.0763592Z 
2025-05-07T20:25:57.0763595Z 
2025-05-07T20:25:57.0763599Z 
2025-05-07T20:25:57.0763603Z 
2025-05-07T20:25:57.0763606Z 
2025-05-07T20:25:57.0763610Z 
2025-05-07T20:25:57.0763614Z 
2025-05-07T20:25:57.0763618Z 
2025-05-07T20:25:57.0763621Z 
2025-05-07T20:25:57.0763636Z 
2025-05-07T20:25:57.0763640Z 
2025-05-07T20:25:57.0763823Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0764103Z 
2025-05-07T20:25:57.0764109Z 
2025-05-07T20:25:57.0764114Z 
2025-05-07T20:25:57.0764119Z 
2025-05-07T20:25:57.0764124Z 
2025-05-07T20:25:57.0764129Z 
2025-05-07T20:25:57.0764142Z 
2025-05-07T20:25:57.0764147Z 
2025-05-07T20:25:57.0764153Z 
2025-05-07T20:25:57.0764158Z 
2025-05-07T20:25:57.0764163Z 
2025-05-07T20:25:57.0764168Z 
2025-05-07T20:25:57.0764173Z 
2025-05-07T20:25:57.0764178Z 
2025-05-07T20:25:57.0764183Z 
2025-05-07T20:25:57.0764398Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0764682Z 
2025-05-07T20:25:57.0764687Z 
2025-05-07T20:25:57.0764692Z 
2025-05-07T20:25:57.0764697Z 
2025-05-07T20:25:57.0764702Z 
2025-05-07T20:25:57.0764708Z 
2025-05-07T20:25:57.0764713Z 
2025-05-07T20:25:57.0764718Z 
2025-05-07T20:25:57.0764723Z 
2025-05-07T20:25:57.0764728Z 
2025-05-07T20:25:57.0764733Z 
2025-05-07T20:25:57.0764738Z 
2025-05-07T20:25:57.0764887Z 
2025-05-07T20:25:57.0764893Z 
2025-05-07T20:25:57.0764898Z 
2025-05-07T20:25:57.0764903Z 
2025-05-07T20:25:57.0765133Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0765419Z 
2025-05-07T20:25:57.0765424Z 
2025-05-07T20:25:57.0765430Z 
2025-05-07T20:25:57.0765435Z 
2025-05-07T20:25:57.0765440Z 
2025-05-07T20:25:57.0765445Z 
2025-05-07T20:25:57.0765451Z 
2025-05-07T20:25:57.0765456Z 
2025-05-07T20:25:57.0765461Z 
2025-05-07T20:25:57.0765466Z 
2025-05-07T20:25:57.0765471Z 
2025-05-07T20:25:57.0765476Z 
2025-05-07T20:25:57.0765482Z 
2025-05-07T20:25:57.0765497Z 
2025-05-07T20:25:57.0765503Z 
2025-05-07T20:25:57.0765508Z 
2025-05-07T20:25:57.0765513Z 
2025-05-07T20:25:57.0765727Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0766019Z 
2025-05-07T20:25:57.0766025Z 
2025-05-07T20:25:57.0766030Z 
2025-05-07T20:25:57.0766042Z 
2025-05-07T20:25:57.0766048Z 
2025-05-07T20:25:57.0766053Z 
2025-05-07T20:25:57.0766058Z 
2025-05-07T20:25:57.0766063Z 
2025-05-07T20:25:57.0766083Z 
2025-05-07T20:25:57.0766088Z 
2025-05-07T20:25:57.0766094Z 
2025-05-07T20:25:57.0766099Z 
2025-05-07T20:25:57.0766104Z 
2025-05-07T20:25:57.0766109Z 
2025-05-07T20:25:57.0766114Z 
2025-05-07T20:25:57.0766119Z 
2025-05-07T20:25:57.0766124Z 
2025-05-07T20:25:57.0766129Z 
2025-05-07T20:25:57.0766354Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0766655Z 
2025-05-07T20:25:57.0766660Z 
2025-05-07T20:25:57.0766798Z [A
2025-05-07T20:25:57.0766949Z 
2025-05-07T20:25:57.0766954Z 
2025-05-07T20:25:57.0767097Z [A[A
2025-05-07T20:25:57.0767247Z 
2025-05-07T20:25:57.0767253Z 
2025-05-07T20:25:57.0767258Z 
2025-05-07T20:25:57.0767410Z [A[A[A
2025-05-07T20:25:57.0767558Z 
2025-05-07T20:25:57.0767563Z 
2025-05-07T20:25:57.0767568Z 
2025-05-07T20:25:57.0767574Z 
2025-05-07T20:25:57.0767728Z [A[A[A[A
2025-05-07T20:25:57.0767885Z 
2025-05-07T20:25:57.0767891Z 
2025-05-07T20:25:57.0767896Z 
2025-05-07T20:25:57.0767901Z 
2025-05-07T20:25:57.0767906Z 
2025-05-07T20:25:57.0768094Z [A[A[A[A[A
2025-05-07T20:25:57.0768375Z 
2025-05-07T20:25:57.0768380Z 
2025-05-07T20:25:57.0768386Z 
2025-05-07T20:25:57.0768391Z 
2025-05-07T20:25:57.0768396Z 
2025-05-07T20:25:57.0768401Z 
2025-05-07T20:25:57.0768564Z [A[A[A[A[A[A
2025-05-07T20:25:57.0768740Z 
2025-05-07T20:25:57.0768745Z 
2025-05-07T20:25:57.0768751Z 
2025-05-07T20:25:57.0768756Z 
2025-05-07T20:25:57.0768761Z 
2025-05-07T20:25:57.0768766Z 
2025-05-07T20:25:57.0768771Z 
2025-05-07T20:25:57.0768938Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0769141Z 
2025-05-07T20:25:57.0769146Z 
2025-05-07T20:25:57.0769151Z 
2025-05-07T20:25:57.0769156Z 
2025-05-07T20:25:57.0769162Z 
2025-05-07T20:25:57.0769167Z 
2025-05-07T20:25:57.0769172Z 
2025-05-07T20:25:57.0769177Z 
2025-05-07T20:25:57.0769355Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0769565Z 
2025-05-07T20:25:57.0769570Z 
2025-05-07T20:25:57.0769575Z 
2025-05-07T20:25:57.0769580Z 
2025-05-07T20:25:57.0769585Z 
2025-05-07T20:25:57.0769591Z 
2025-05-07T20:25:57.0769604Z 
2025-05-07T20:25:57.0769617Z 
2025-05-07T20:25:57.0769622Z 
2025-05-07T20:25:57.0769804Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0770020Z 
2025-05-07T20:25:57.0770025Z 
2025-05-07T20:25:57.0770030Z 
2025-05-07T20:25:57.0770035Z 
2025-05-07T20:25:57.0770040Z 
2025-05-07T20:25:57.0770045Z 
2025-05-07T20:25:57.0770051Z 
2025-05-07T20:25:57.0770056Z 
2025-05-07T20:25:57.0770061Z 
2025-05-07T20:25:57.0770075Z 
2025-05-07T20:25:57.0770252Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0770479Z 
2025-05-07T20:25:57.0770484Z 
2025-05-07T20:25:57.0770489Z 
2025-05-07T20:25:57.0770495Z 
2025-05-07T20:25:57.0770500Z 
2025-05-07T20:25:57.0770505Z 
2025-05-07T20:25:57.0770510Z 
2025-05-07T20:25:57.0770523Z 
2025-05-07T20:25:57.0770529Z 
2025-05-07T20:25:57.0770533Z 
2025-05-07T20:25:57.0770539Z 
2025-05-07T20:25:57.0770722Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0770962Z 
2025-05-07T20:25:57.0770967Z 
2025-05-07T20:25:57.0770972Z 
2025-05-07T20:25:57.0770986Z 
2025-05-07T20:25:57.0770991Z 
2025-05-07T20:25:57.0771099Z 
2025-05-07T20:25:57.0771105Z 
2025-05-07T20:25:57.0771110Z 
2025-05-07T20:25:57.0771115Z 
2025-05-07T20:25:57.0771120Z 
2025-05-07T20:25:57.0771126Z 
2025-05-07T20:25:57.0771131Z 
2025-05-07T20:25:57.0771317Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0771575Z 
2025-05-07T20:25:57.0771580Z 
2025-05-07T20:25:57.0771585Z 
2025-05-07T20:25:57.0771590Z 
2025-05-07T20:25:57.0771596Z 
2025-05-07T20:25:57.0771601Z 
2025-05-07T20:25:57.0771606Z 
2025-05-07T20:25:57.0771611Z 
2025-05-07T20:25:57.0771616Z 
2025-05-07T20:25:57.0771621Z 
2025-05-07T20:25:57.0771626Z 
2025-05-07T20:25:57.0771631Z 
2025-05-07T20:25:57.0771637Z 
2025-05-07T20:25:57.0771824Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0772095Z 
2025-05-07T20:25:57.0772100Z 
2025-05-07T20:25:57.0772105Z 
2025-05-07T20:25:57.0772110Z 
2025-05-07T20:25:57.0772116Z 
2025-05-07T20:25:57.0772121Z 
2025-05-07T20:25:57.0772126Z 
2025-05-07T20:25:57.0772131Z 
2025-05-07T20:25:57.0772136Z 
2025-05-07T20:25:57.0772148Z 
2025-05-07T20:25:57.0772163Z 
2025-05-07T20:25:57.0772168Z 
2025-05-07T20:25:57.0772173Z 
2025-05-07T20:25:57.0772179Z 
2025-05-07T20:25:57.0772413Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0772683Z 
2025-05-07T20:25:57.0772689Z 
2025-05-07T20:25:57.0772694Z 
2025-05-07T20:25:57.0772707Z 
2025-05-07T20:25:57.0772713Z 
2025-05-07T20:25:57.0772718Z 
2025-05-07T20:25:57.0772723Z 
2025-05-07T20:25:57.0772728Z 
2025-05-07T20:25:57.0772733Z 
2025-05-07T20:25:57.0772739Z 
2025-05-07T20:25:57.0772745Z 
2025-05-07T20:25:57.0772750Z 
2025-05-07T20:25:57.0772756Z 
2025-05-07T20:25:57.0772761Z 
2025-05-07T20:25:57.0772767Z 
2025-05-07T20:25:57.0772976Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0773270Z 
2025-05-07T20:25:57.0773276Z 
2025-05-07T20:25:57.0773281Z 
2025-05-07T20:25:57.0773287Z 
2025-05-07T20:25:57.0773292Z 
2025-05-07T20:25:57.0773297Z 
2025-05-07T20:25:57.0773302Z 
2025-05-07T20:25:57.0773308Z 
2025-05-07T20:25:57.0773313Z 
2025-05-07T20:25:57.0773327Z 
2025-05-07T20:25:57.0773452Z 
2025-05-07T20:25:57.0773457Z 
2025-05-07T20:25:57.0773462Z 
2025-05-07T20:25:57.0773468Z 
2025-05-07T20:25:57.0773473Z 
2025-05-07T20:25:57.0773478Z 
2025-05-07T20:25:57.0773739Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0774050Z 
2025-05-07T20:25:57.0774055Z 
2025-05-07T20:25:57.0774060Z 
2025-05-07T20:25:57.0774065Z 
2025-05-07T20:25:57.0774070Z 
2025-05-07T20:25:57.0774076Z 
2025-05-07T20:25:57.0774081Z 
2025-05-07T20:25:57.0774086Z 
2025-05-07T20:25:57.0774092Z 
2025-05-07T20:25:57.0774097Z 
2025-05-07T20:25:57.0774114Z 
2025-05-07T20:25:57.0774119Z 
2025-05-07T20:25:57.0774124Z 
2025-05-07T20:25:57.0774130Z 
2025-05-07T20:25:57.0774135Z 
2025-05-07T20:25:57.0774140Z 
2025-05-07T20:25:57.0774145Z 
2025-05-07T20:25:57.0774376Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0774678Z 
2025-05-07T20:25:57.0774683Z 
2025-05-07T20:25:57.0774689Z 
2025-05-07T20:25:57.0774695Z 
2025-05-07T20:25:57.0774700Z 
2025-05-07T20:25:57.0774712Z 
2025-05-07T20:25:57.0774725Z 
2025-05-07T20:25:57.0774731Z 
2025-05-07T20:25:57.0774736Z 
2025-05-07T20:25:57.0774740Z 
2025-05-07T20:25:57.0774745Z 
2025-05-07T20:25:57.0774750Z 
2025-05-07T20:25:57.0774755Z 
2025-05-07T20:25:57.0774760Z 
2025-05-07T20:25:57.0774765Z 
2025-05-07T20:25:57.0774770Z 
2025-05-07T20:25:57.0774775Z 
2025-05-07T20:25:57.0774780Z 
2025-05-07T20:25:57.0775017Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0775309Z 
2025-05-07T20:25:57.0775315Z 
2025-05-07T20:25:57.0775456Z [A
2025-05-07T20:25:57.0775615Z 
2025-05-07T20:25:57.0775621Z 
2025-05-07T20:25:57.0775763Z [A[A
2025-05-07T20:25:57.0775911Z 
2025-05-07T20:25:57.0775924Z 
2025-05-07T20:25:57.0775929Z 
2025-05-07T20:25:57.0776076Z [A[A[A
2025-05-07T20:25:57.0776229Z 
2025-05-07T20:25:57.0776234Z 
2025-05-07T20:25:57.0776239Z 
2025-05-07T20:25:57.0776244Z 
2025-05-07T20:25:57.0776434Z [A[A[A[A
2025-05-07T20:25:57.0776597Z 
2025-05-07T20:25:57.0776603Z 
2025-05-07T20:25:57.0776608Z 
2025-05-07T20:25:57.0776734Z 
2025-05-07T20:25:57.0776740Z 
2025-05-07T20:25:57.0776911Z [A[A[A[A[A
2025-05-07T20:25:57.0777088Z 
2025-05-07T20:25:57.0777093Z 
2025-05-07T20:25:57.0777099Z 
2025-05-07T20:25:57.0777104Z 
2025-05-07T20:25:57.0777109Z 
2025-05-07T20:25:57.0777114Z 
2025-05-07T20:25:57.0777287Z [A[A[A[A[A[A
2025-05-07T20:25:57.0777466Z 
2025-05-07T20:25:57.0777472Z 
2025-05-07T20:25:57.0777477Z 
2025-05-07T20:25:57.0777482Z 
2025-05-07T20:25:57.0777487Z 
2025-05-07T20:25:57.0777492Z 
2025-05-07T20:25:57.0777497Z 
2025-05-07T20:25:57.0777671Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0777862Z 
2025-05-07T20:25:57.0777867Z 
2025-05-07T20:25:57.0777872Z 
2025-05-07T20:25:57.0777878Z 
2025-05-07T20:25:57.0777883Z 
2025-05-07T20:25:57.0777888Z 
2025-05-07T20:25:57.0777893Z 
2025-05-07T20:25:57.0777898Z 
2025-05-07T20:25:57.0778082Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0778290Z 
2025-05-07T20:25:57.0778296Z 
2025-05-07T20:25:57.0778301Z 
2025-05-07T20:25:57.0778306Z 
2025-05-07T20:25:57.0778322Z 
2025-05-07T20:25:57.0778332Z 
2025-05-07T20:25:57.0778337Z 
2025-05-07T20:25:57.0778342Z 
2025-05-07T20:25:57.0778355Z 
2025-05-07T20:25:57.0778529Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0778745Z 
2025-05-07T20:25:57.0778750Z 
2025-05-07T20:25:57.0778756Z 
2025-05-07T20:25:57.0778761Z 
2025-05-07T20:25:57.0778766Z 
2025-05-07T20:25:57.0778771Z 
2025-05-07T20:25:57.0778784Z 
2025-05-07T20:25:57.0778789Z 
2025-05-07T20:25:57.0778794Z 
2025-05-07T20:25:57.0778800Z 
2025-05-07T20:25:57.0778973Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0779199Z 
2025-05-07T20:25:57.0779204Z 
2025-05-07T20:25:57.0779209Z 
2025-05-07T20:25:57.0779215Z 
2025-05-07T20:25:57.0779227Z 
2025-05-07T20:25:57.0779232Z 
2025-05-07T20:25:57.0779237Z 
2025-05-07T20:25:57.0779243Z 
2025-05-07T20:25:57.0779248Z 
2025-05-07T20:25:57.0779253Z 
2025-05-07T20:25:57.0779258Z 
2025-05-07T20:25:57.0779436Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0779693Z 
2025-05-07T20:25:57.0779699Z 
2025-05-07T20:25:57.0779710Z 
2025-05-07T20:25:57.0779824Z 
2025-05-07T20:25:57.0779829Z 
2025-05-07T20:25:57.0779835Z 
2025-05-07T20:25:57.0779840Z 
2025-05-07T20:25:57.0779845Z 
2025-05-07T20:25:57.0779850Z 
2025-05-07T20:25:57.0779856Z 
2025-05-07T20:25:57.0779861Z 
2025-05-07T20:25:57.0779866Z 
2025-05-07T20:25:57.0780061Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0780323Z 
2025-05-07T20:25:57.0780328Z 
2025-05-07T20:25:57.0780333Z 
2025-05-07T20:25:57.0780339Z 
2025-05-07T20:25:57.0780344Z 
2025-05-07T20:25:57.0780349Z 
2025-05-07T20:25:57.0780354Z 
2025-05-07T20:25:57.0780359Z 
2025-05-07T20:25:57.0780364Z 
2025-05-07T20:25:57.0780369Z 
2025-05-07T20:25:57.0780375Z 
2025-05-07T20:25:57.0780380Z 
2025-05-07T20:25:57.0780385Z 
2025-05-07T20:25:57.0780611Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0780875Z 
2025-05-07T20:25:57.0780879Z 
2025-05-07T20:25:57.0780884Z 
2025-05-07T20:25:57.0780890Z 
2025-05-07T20:25:57.0780895Z 
2025-05-07T20:25:57.0780900Z 
2025-05-07T20:25:57.0780905Z 
2025-05-07T20:25:57.0780925Z 
2025-05-07T20:25:57.0780937Z 
2025-05-07T20:25:57.0780942Z 
2025-05-07T20:25:57.0780948Z 
2025-05-07T20:25:57.0780953Z 
2025-05-07T20:25:57.0780958Z 
2025-05-07T20:25:57.0780963Z 
2025-05-07T20:25:57.0781290Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0781574Z 
2025-05-07T20:25:57.0781580Z 
2025-05-07T20:25:57.0781585Z 
2025-05-07T20:25:57.0781590Z 
2025-05-07T20:25:57.0781596Z 
2025-05-07T20:25:57.0781601Z 
2025-05-07T20:25:57.0781606Z 
2025-05-07T20:25:57.0781611Z 
2025-05-07T20:25:57.0781616Z 
2025-05-07T20:25:57.0781621Z 
2025-05-07T20:25:57.0781627Z 
2025-05-07T20:25:57.0781632Z 
2025-05-07T20:25:57.0781637Z 
2025-05-07T20:25:57.0781642Z 
2025-05-07T20:25:57.0781647Z 
2025-05-07T20:25:57.0781859Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0782147Z 
2025-05-07T20:25:57.0782152Z 
2025-05-07T20:25:57.0782157Z 
2025-05-07T20:25:57.0782162Z 
2025-05-07T20:25:57.0782167Z 
2025-05-07T20:25:57.0782173Z 
2025-05-07T20:25:57.0782178Z 
2025-05-07T20:25:57.0782300Z 
2025-05-07T20:25:57.0782314Z 
2025-05-07T20:25:57.0782320Z 
2025-05-07T20:25:57.0782325Z 
2025-05-07T20:25:57.0782330Z 
2025-05-07T20:25:57.0782336Z 
2025-05-07T20:25:57.0782341Z 
2025-05-07T20:25:57.0782346Z 
2025-05-07T20:25:57.0782351Z 
2025-05-07T20:25:57.0782580Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0782868Z 
2025-05-07T20:25:57.0782873Z 
2025-05-07T20:25:57.0782878Z 
2025-05-07T20:25:57.0782884Z 
2025-05-07T20:25:57.0782889Z 
2025-05-07T20:25:57.0782894Z 
2025-05-07T20:25:57.0782899Z 
2025-05-07T20:25:57.0782911Z 
2025-05-07T20:25:57.0782917Z 
2025-05-07T20:25:57.0782922Z 
2025-05-07T20:25:57.0782927Z 
2025-05-07T20:25:57.0782933Z 
2025-05-07T20:25:57.0782937Z 
2025-05-07T20:25:57.0782943Z 
2025-05-07T20:25:57.0782948Z 
2025-05-07T20:25:57.0782953Z 
2025-05-07T20:25:57.0782958Z 
2025-05-07T20:25:57.0783177Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0783474Z 
2025-05-07T20:25:57.0783480Z 
2025-05-07T20:25:57.0783485Z 
2025-05-07T20:25:57.0783496Z 
2025-05-07T20:25:57.0783506Z 
2025-05-07T20:25:57.0783511Z 
2025-05-07T20:25:57.0783516Z 
2025-05-07T20:25:57.0783521Z 
2025-05-07T20:25:57.0783526Z 
2025-05-07T20:25:57.0783531Z 
2025-05-07T20:25:57.0783537Z 
2025-05-07T20:25:57.0783542Z 
2025-05-07T20:25:57.0783549Z 
2025-05-07T20:25:57.0783555Z 
2025-05-07T20:25:57.0783562Z 
2025-05-07T20:25:57.0783569Z 
2025-05-07T20:25:57.0783575Z 
2025-05-07T20:25:57.0783582Z 
2025-05-07T20:25:57.0783856Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0784153Z 
2025-05-07T20:25:57.0784158Z 
2025-05-07T20:25:57.0784295Z [A
2025-05-07T20:25:57.0784444Z 
2025-05-07T20:25:57.0784449Z 
2025-05-07T20:25:57.0784588Z [A[A
2025-05-07T20:25:57.0784744Z 
2025-05-07T20:25:57.0784749Z 
2025-05-07T20:25:57.0784754Z 
2025-05-07T20:25:57.0784902Z [A[A[A
2025-05-07T20:25:57.0785053Z 
2025-05-07T20:25:57.0785059Z 
2025-05-07T20:25:57.0785064Z 
2025-05-07T20:25:57.0785084Z 
2025-05-07T20:25:57.0785268Z [A[A[A[A
2025-05-07T20:25:57.0785446Z 
2025-05-07T20:25:57.0785554Z 
2025-05-07T20:25:57.0785559Z 
2025-05-07T20:25:57.0785564Z 
2025-05-07T20:25:57.0785570Z 
2025-05-07T20:25:57.0785725Z [A[A[A[A[A
2025-05-07T20:25:57.0785906Z 
2025-05-07T20:25:57.0785911Z 
2025-05-07T20:25:57.0785916Z 
2025-05-07T20:25:57.0785921Z 
2025-05-07T20:25:57.0785927Z 
2025-05-07T20:25:57.0785932Z 
2025-05-07T20:25:57.0786087Z [A[A[A[A[A[A
2025-05-07T20:25:57.0786269Z 
2025-05-07T20:25:57.0786274Z 
2025-05-07T20:25:57.0786279Z 
2025-05-07T20:25:57.0786284Z 
2025-05-07T20:25:57.0786289Z 
2025-05-07T20:25:57.0786294Z 
2025-05-07T20:25:57.0786300Z 
2025-05-07T20:25:57.0786463Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0786663Z 
2025-05-07T20:25:57.0786669Z 
2025-05-07T20:25:57.0786674Z 
2025-05-07T20:25:57.0786679Z 
2025-05-07T20:25:57.0786684Z 
2025-05-07T20:25:57.0786690Z 
2025-05-07T20:25:57.0786695Z 
2025-05-07T20:25:57.0786700Z 
2025-05-07T20:25:57.0786864Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0787087Z 
2025-05-07T20:25:57.0787092Z 
2025-05-07T20:25:57.0787101Z 
2025-05-07T20:25:57.0787115Z 
2025-05-07T20:25:57.0787120Z 
2025-05-07T20:25:57.0787125Z 
2025-05-07T20:25:57.0787130Z 
2025-05-07T20:25:57.0787135Z 
2025-05-07T20:25:57.0787141Z 
2025-05-07T20:25:57.0787313Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0787538Z 
2025-05-07T20:25:57.0787544Z 
2025-05-07T20:25:57.0787549Z 
2025-05-07T20:25:57.0787554Z 
2025-05-07T20:25:57.0787559Z 
2025-05-07T20:25:57.0787564Z 
2025-05-07T20:25:57.0787569Z 
2025-05-07T20:25:57.0787574Z 
2025-05-07T20:25:57.0787580Z 
2025-05-07T20:25:57.0787585Z 
2025-05-07T20:25:57.0787758Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0787993Z 
2025-05-07T20:25:57.0787998Z 
2025-05-07T20:25:57.0788003Z 
2025-05-07T20:25:57.0788008Z 
2025-05-07T20:25:57.0788013Z 
2025-05-07T20:25:57.0788019Z 
2025-05-07T20:25:57.0788024Z 
2025-05-07T20:25:57.0788029Z 
2025-05-07T20:25:57.0788034Z 
2025-05-07T20:25:57.0788040Z 
2025-05-07T20:25:57.0788045Z 
2025-05-07T20:25:57.0788236Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0788608Z 
2025-05-07T20:25:57.0788621Z 
2025-05-07T20:25:57.0788626Z 
2025-05-07T20:25:57.0788631Z 
2025-05-07T20:25:57.0788637Z 
2025-05-07T20:25:57.0788642Z 
2025-05-07T20:25:57.0788647Z 
2025-05-07T20:25:57.0788652Z 
2025-05-07T20:25:57.0788657Z 
2025-05-07T20:25:57.0788662Z 
2025-05-07T20:25:57.0788667Z 
2025-05-07T20:25:57.0788673Z 
2025-05-07T20:25:57.0788886Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0789133Z 
2025-05-07T20:25:57.0789139Z 
2025-05-07T20:25:57.0789144Z 
2025-05-07T20:25:57.0789149Z 
2025-05-07T20:25:57.0789154Z 
2025-05-07T20:25:57.0789160Z 
2025-05-07T20:25:57.0789164Z 
2025-05-07T20:25:57.0789170Z 
2025-05-07T20:25:57.0789175Z 
2025-05-07T20:25:57.0789180Z 
2025-05-07T20:25:57.0789185Z 
2025-05-07T20:25:57.0789200Z 
2025-05-07T20:25:57.0789205Z 
2025-05-07T20:25:57.0789396Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0789656Z 
2025-05-07T20:25:57.0789661Z 
2025-05-07T20:25:57.0789666Z 
2025-05-07T20:25:57.0789671Z 
2025-05-07T20:25:57.0789676Z 
2025-05-07T20:25:57.0789700Z 
2025-05-07T20:25:57.0789705Z 
2025-05-07T20:25:57.0789710Z 
2025-05-07T20:25:57.0789716Z 
2025-05-07T20:25:57.0789721Z 
2025-05-07T20:25:57.0789726Z 
2025-05-07T20:25:57.0789731Z 
2025-05-07T20:25:57.0789737Z 
2025-05-07T20:25:57.0789742Z 
2025-05-07T20:25:57.0789940Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0790143Z 
2025-05-07T20:25:57.0790147Z 
2025-05-07T20:25:57.0790151Z 
2025-05-07T20:25:57.0790154Z 
2025-05-07T20:25:57.0790158Z 
2025-05-07T20:25:57.0790162Z 
2025-05-07T20:25:57.0790165Z 
2025-05-07T20:25:57.0790169Z 
2025-05-07T20:25:57.0790172Z 
2025-05-07T20:25:57.0790176Z 
2025-05-07T20:25:57.0790180Z 
2025-05-07T20:25:57.0790183Z 
2025-05-07T20:25:57.0790187Z 
2025-05-07T20:25:57.0790190Z 
2025-05-07T20:25:57.0790194Z 
2025-05-07T20:25:57.0790347Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0790547Z 
2025-05-07T20:25:57.0790551Z 
2025-05-07T20:25:57.0790554Z 
2025-05-07T20:25:57.0790558Z 
2025-05-07T20:25:57.0790562Z 
2025-05-07T20:25:57.0790570Z 
2025-05-07T20:25:57.0790663Z 
2025-05-07T20:25:57.0790667Z 
2025-05-07T20:25:57.0790670Z 
2025-05-07T20:25:57.0790674Z 
2025-05-07T20:25:57.0790677Z 
2025-05-07T20:25:57.0790681Z 
2025-05-07T20:25:57.0790685Z 
2025-05-07T20:25:57.0790688Z 
2025-05-07T20:25:57.0790699Z 
2025-05-07T20:25:57.0790703Z 
2025-05-07T20:25:57.0790860Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0791067Z 
2025-05-07T20:25:57.0791071Z 
2025-05-07T20:25:57.0791075Z 
2025-05-07T20:25:57.0791078Z 
2025-05-07T20:25:57.0791082Z 
2025-05-07T20:25:57.0791092Z 
2025-05-07T20:25:57.0791096Z 
2025-05-07T20:25:57.0791099Z 
2025-05-07T20:25:57.0791103Z 
2025-05-07T20:25:57.0791106Z 
2025-05-07T20:25:57.0791110Z 
2025-05-07T20:25:57.0791114Z 
2025-05-07T20:25:57.0791117Z 
2025-05-07T20:25:57.0791121Z 
2025-05-07T20:25:57.0791124Z 
2025-05-07T20:25:57.0791128Z 
2025-05-07T20:25:57.0791132Z 
2025-05-07T20:25:57.0791315Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0791522Z 
2025-05-07T20:25:57.0791534Z 
2025-05-07T20:25:57.0791537Z 
2025-05-07T20:25:57.0791541Z 
2025-05-07T20:25:57.0791544Z 
2025-05-07T20:25:57.0791548Z 
2025-05-07T20:25:57.0791552Z 
2025-05-07T20:25:57.0791555Z 
2025-05-07T20:25:57.0791559Z 
2025-05-07T20:25:57.0791562Z 
2025-05-07T20:25:57.0791566Z 
2025-05-07T20:25:57.0791570Z 
2025-05-07T20:25:57.0791582Z 
2025-05-07T20:25:57.0791585Z 
2025-05-07T20:25:57.0791589Z 
2025-05-07T20:25:57.0791593Z 
2025-05-07T20:25:57.0791596Z 
2025-05-07T20:25:57.0791600Z 
2025-05-07T20:25:57.0791761Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0791971Z 
2025-05-07T20:25:57.0791974Z 
2025-05-07T20:25:57.0792081Z [A
2025-05-07T20:25:57.0792184Z 
2025-05-07T20:25:57.0792188Z 
2025-05-07T20:25:57.0792295Z [A[A
2025-05-07T20:25:57.0792403Z 
2025-05-07T20:25:57.0792406Z 
2025-05-07T20:25:57.0792410Z 
2025-05-07T20:25:57.0792515Z [A[A[A
2025-05-07T20:25:57.0792652Z 
2025-05-07T20:25:57.0792658Z 
2025-05-07T20:25:57.0792663Z 
2025-05-07T20:25:57.0792770Z 
2025-05-07T20:25:57.0792953Z [A[A[A[A
2025-05-07T20:25:57.0793130Z 
2025-05-07T20:25:57.0793135Z 
2025-05-07T20:25:57.0793140Z 
2025-05-07T20:25:57.0793145Z 
2025-05-07T20:25:57.0793151Z 
2025-05-07T20:25:57.0793308Z [A[A[A[A[A
2025-05-07T20:25:57.0793479Z 
2025-05-07T20:25:57.0793492Z 
2025-05-07T20:25:57.0793498Z 
2025-05-07T20:25:57.0793503Z 
2025-05-07T20:25:57.0793508Z 
2025-05-07T20:25:57.0793513Z 
2025-05-07T20:25:57.0793667Z [A[A[A[A[A[A
2025-05-07T20:25:57.0793849Z 
2025-05-07T20:25:57.0793855Z 
2025-05-07T20:25:57.0793860Z 
2025-05-07T20:25:57.0793876Z 
2025-05-07T20:25:57.0793881Z 
2025-05-07T20:25:57.0793887Z 
2025-05-07T20:25:57.0793892Z 
2025-05-07T20:25:57.0794019Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.0794158Z 
2025-05-07T20:25:57.0794162Z 
2025-05-07T20:25:57.0794165Z 
2025-05-07T20:25:57.0794176Z 
2025-05-07T20:25:57.0794180Z 
2025-05-07T20:25:57.0794183Z 
2025-05-07T20:25:57.0794187Z 
2025-05-07T20:25:57.0794190Z 
2025-05-07T20:25:57.0794311Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0794473Z 
2025-05-07T20:25:57.0794476Z 
2025-05-07T20:25:57.0794480Z 
2025-05-07T20:25:57.0794490Z 
2025-05-07T20:25:57.0794494Z 
2025-05-07T20:25:57.0794497Z 
2025-05-07T20:25:57.0794501Z 
2025-05-07T20:25:57.0794504Z 
2025-05-07T20:25:57.0794508Z 
2025-05-07T20:25:57.0794633Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0794791Z 
2025-05-07T20:25:57.0794794Z 
2025-05-07T20:25:57.0794805Z 
2025-05-07T20:25:57.0794809Z 
2025-05-07T20:25:57.0794812Z 
2025-05-07T20:25:57.0794816Z 
2025-05-07T20:25:57.0794819Z 
2025-05-07T20:25:57.0794823Z 
2025-05-07T20:25:57.0794826Z 
2025-05-07T20:25:57.0794830Z 
2025-05-07T20:25:57.0794960Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0795131Z 
2025-05-07T20:25:57.0795135Z 
2025-05-07T20:25:57.0795138Z 
2025-05-07T20:25:57.0795142Z 
2025-05-07T20:25:57.0795146Z 
2025-05-07T20:25:57.0795149Z 
2025-05-07T20:25:57.0795153Z 
2025-05-07T20:25:57.0795156Z 
2025-05-07T20:25:57.0795160Z 
2025-05-07T20:25:57.0795164Z 
2025-05-07T20:25:57.0795167Z 
2025-05-07T20:25:57.0795395Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.0795580Z 
2025-05-07T20:25:57.0795583Z 
2025-05-07T20:25:57.0795587Z 
2025-05-07T20:25:57.0795591Z 
2025-05-07T20:25:57.0795594Z 
2025-05-07T20:25:57.0795598Z 
2025-05-07T20:25:57.0795601Z 
2025-05-07T20:25:57.0795605Z 
2025-05-07T20:25:57.0795609Z 
2025-05-07T20:25:57.0795612Z 
2025-05-07T20:25:57.0795616Z 
2025-05-07T20:25:57.0795619Z 
2025-05-07T20:25:57.0795773Z [A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:57.3985879Z Preparing transaction: \ | / done
2025-05-07T20:25:58.8331624Z Verifying transaction: \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:59.6589967Z Executing transaction: - \ | / - \ | / done
2025-05-07T20:26:02.0140503Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:02.0141070Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:02.0141966Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:02.0142553Z 
2025-05-07T20:26:02.0156002Z 
2025-05-07T20:26:02.0157112Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:02.0157926Z 
2025-05-07T20:26:02.0170272Z 
2025-05-07T20:26:02.0170677Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:02.0175794Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:02.0179605Z 
2025-05-07T20:26:02.0384494Z 
2025-05-07T20:26:02.0390019Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:02.0393776Z 
2025-05-07T20:26:02.0411509Z 
2025-05-07T20:26:02.0411932Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:02.0791130Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:03.9950362Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:04.0624070Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:04.0624690Z 
2025-05-07T20:26:04.4854073Z 
2025-05-07T20:26:04.4862500Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:04.5216131Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:04.5216637Z 
2025-05-07T20:26:04.9636307Z 
2025-05-07T20:26:04.9636633Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:04.9637554Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:04.9638269Z 
2025-05-07T20:26:05.3891476Z 
2025-05-07T20:26:07.4178656Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:09.4456852Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:11.4686588Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:11.4687411Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:13.4993050Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:15.3885092Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:15.3885383Z 
2025-05-07T20:26:15.4521315Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:19.3265838Z /tmp/tmp2em5kp8a: line 3: clang: command not found
2025-05-07T20:26:19.3266203Z 
2025-05-07T20:26:19.3266953Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:19.3913614Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:19.3914133Z 
2025-05-07T20:26:19.3934940Z total 36
2025-05-07T20:26:19.3935250Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:25 .
2025-05-07T20:26:19.3935642Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:19.3936089Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:19.3936608Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:19.3937254Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:19.3937855Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:19.3938359Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:19.3938820Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:19.3939109Z 
2025-05-07T20:26:19.3939334Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:19.3939974Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:19.3940638Z 
2025-05-07T20:26:19.3960467Z 
2025-05-07T20:26:19.3960781Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:19.3961122Z 
2025-05-07T20:26:21.3570600Z 
2025-05-07T20:26:21.3571271Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:21.3571843Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:21.3572226Z 
2025-05-07T20:26:21.7909547Z 
2025-05-07T20:26:21.7909953Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:21.7910328Z 
2025-05-07T20:26:23.6979050Z -allow-unsupported-compiler
2025-05-07T20:26:23.6979317Z 
2025-05-07T20:26:23.7637898Z 
2025-05-07T20:26:23.7638708Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:23.7639736Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:23.7640742Z 
2025-05-07T20:26:25.7282746Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:25.7283854Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:25.7284198Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:25.7284523Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:25.7284858Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:25.7285136Z #define _STL_PAIR_H 1
2025-05-07T20:26:25.7285388Z #define __cpp_attributes 200809L
2025-05-07T20:26:25.7285726Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:25.7286087Z #define __DELETE_THROW throw()
2025-05-07T20:26:25.7286351Z #define _PTRDIFF_T_ 
2025-05-07T20:26:25.7286611Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:25.7286907Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:25.7287178Z #define _IO_LEFT 02
2025-05-07T20:26:25.7287415Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:25.7287681Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:25.7287956Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:25.7288406Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:25.7288844Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:25.7289132Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:25.7289383Z #define _IOS_OUTPUT 2
2025-05-07T20:26:25.7289687Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:25.7290061Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:25.7290369Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:25.7290654Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:25.7290954Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:25.7291737Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:25.7292778Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:25.7293096Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:25.7293571Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:25.7293897Z #define _T_WCHAR_ 
2025-05-07T20:26:25.7294123Z #define stdout stdout
2025-05-07T20:26:25.7294465Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:25.7294850Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:25.7295107Z #define __flexarr []
2025-05-07T20:26:25.7295359Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:25.7295694Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:25.7296044Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:25.7296310Z #define _MATH_H 1
2025-05-07T20:26:25.7296649Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:25.7297143Z #define __S64_TYPE long int
2025-05-07T20:26:25.7297506Z #define __stub_fchflags 
2025-05-07T20:26:25.7297881Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:25.7298277Z #define __SQUAD_TYPE long int
2025-05-07T20:26:25.7298652Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:25.7299033Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:25.7299405Z #define NL_NMAX INT_MAX
2025-05-07T20:26:25.7299731Z #define _BITS_TIME_H 1
2025-05-07T20:26:25.7300127Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:25.7300522Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:25.7300829Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:25.7301289Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:25.7301698Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:25.7302069Z #define __CHAR_BIT__ 8
2025-05-07T20:26:25.7302338Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7302666Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:25.7302962Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:25.7303237Z #define FP_NAN 0
2025-05-07T20:26:25.7303505Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:25.7303954Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:25.7304571Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:25.7304964Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:25.7305258Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:25.7305525Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:25.7305786Z #define __SM_80_RT_H__ 
2025-05-07T20:26:25.7306022Z #define _NEW 
2025-05-07T20:26:25.7306249Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:25.7306541Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:25.7306919Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:25.7307321Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:25.7307574Z #define __USE_ANSI 1
2025-05-07T20:26:25.7307869Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:25.7308271Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:25.7308630Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:25.7308937Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:25.7309231Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:25.7309523Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:25.7309813Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:25.7310107Z #define PIPE_BUF 4096
2025-05-07T20:26:25.7310434Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:25.7310809Z #define ADJ_TICK 0x4000
2025-05-07T20:26:25.7311094Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:25.7311417Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:25.7311693Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:25.7312022Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:25.7312489Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:25.7313017Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:25.7313390Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:25.7313652Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:25.7314014Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7314316Z #define __cpp_static_assert 201411L
2025-05-07T20:26:25.7314662Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:25.7315058Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:25.7315345Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:25.7315636Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:25.7315947Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:25.7316240Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:25.7316548Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7316914Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:25.7317259Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:25.7317549Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:25.7317871Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7318238Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:25.7318599Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:25.7318908Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:25.7319210Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:25.7319543Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:25.7319876Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:25.7320284Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:25.7320708Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:25.7321025Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:25.7321313Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:25.7321598Z #define __GCC_IEC_559 2
2025-05-07T20:26:25.7321901Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:25.7322251Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:25.7322518Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:25.7322798Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:25.7323075Z #define _IOFBF 0
2025-05-07T20:26:25.7323293Z #define __USE_BSD 1
2025-05-07T20:26:25.7323532Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:25.7323913Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:25.7324194Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:25.7324458Z #define _IO_NO_WRITES 8
2025-05-07T20:26:25.7324725Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:25.7325082Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:25.7325447Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:25.7325772Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:25.7326109Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:25.7326406Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:25.7326689Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:25.7326966Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:25.7327285Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:25.7327677Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:25.7328050Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:25.7328359Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:25.7328688Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:25.7329025Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:25.7329331Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:25.7329645Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:25.7329927Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:25.7330201Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:25.7330782Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:25.7331370Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:25.7331703Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:25.7332029Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:25.7332340Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:25.7332623Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:25.7332891Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:25.7333293Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:25.7333637Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:25.7333948Z #define RAND_MAX 2147483647
2025-05-07T20:26:25.7334215Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:25.7334551Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7334873Z #define __SM_90_RT_H__ 
2025-05-07T20:26:25.7335119Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:25.7335389Z #define __COMPAR_FN_T 
2025-05-07T20:26:25.7335642Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.7335910Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:25.7336396Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:25.7336912Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.7337260Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.7337633Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:25.7337966Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:25.7349614Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:25.7349960Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:25.7350490Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:25.7351037Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:25.7351380Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:25.7351664Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:25.7351970Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:25.7352282Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:25.7352572Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:25.7352892Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:25.7353156Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:25.7353413Z #define __u_char_defined 
2025-05-07T20:26:25.7353732Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:25.7354088Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:25.7354355Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:25.7354621Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:25.7355102Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:25.7355544Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:25.7355971Z #define FP_INFINITE 1
2025-05-07T20:26:25.7356343Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:25.7356764Z #define _IO_pid_t __pid_t
2025-05-07T20:26:25.7357025Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:25.7357292Z #define __LEAF , __leaf__
2025-05-07T20:26:25.7357531Z #define PATH_MAX 4096
2025-05-07T20:26:25.7357801Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:25.7358136Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:25.7358461Z #define _LIMITS_H___ 
2025-05-07T20:26:25.7358693Z #define __size_t 
2025-05-07T20:26:25.7358924Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:25.7359474Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:25.7360041Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:25.7360351Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:25.7360676Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:25.7360942Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:25.7361302Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:25.7361695Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:25.7361997Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:25.7362327Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:25.7362609Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:25.7362894Z #define __INT8_C(c) c
2025-05-07T20:26:25.7363159Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:25.7363460Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:25.7363720Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:25.7363986Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:25.7364237Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:25.7364659Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:25.7364993Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7365322Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:25.7365591Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:25.7365871Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:25.7366140Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:25.7366453Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:25.7366760Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:25.7367126Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:25.7367499Z #define NFDBITS __NFDBITS
2025-05-07T20:26:25.7367764Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:25.7368056Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:25.7368377Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:25.7368693Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:25.7368955Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:25.7369253Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:25.7369561Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:25.7369877Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:25.7370295Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:25.7370652Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:25.7370951Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:25.7371273Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:25.7371640Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:25.7371985Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:25.7372309Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:25.7372646Z #define __daddr_t_defined 
2025-05-07T20:26:25.7372895Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:25.7373173Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:25.7373494Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:25.7374009Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:25.7374580Z #define _ACRTIMP 
2025-05-07T20:26:25.7374807Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:25.7375072Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:25.7375368Z #define _IOS_BIN 128
2025-05-07T20:26:25.7375724Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:25.7376139Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7376405Z #define UNDERFLOW 4
2025-05-07T20:26:25.7376628Z #define NAME_MAX 255
2025-05-07T20:26:25.7376867Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:25.7377132Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:25.7377414Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:25.7377710Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:25.7378080Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:25.7378469Z #define __ptr_t void *
2025-05-07T20:26:25.7378715Z #define M_E 2.7182818284590452354
2025-05-07T20:26:25.7379001Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:25.7379274Z #define __USE_ISOCXX11 1
2025-05-07T20:26:25.7379546Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:25.7379859Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:25.7380158Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:25.7380444Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:25.7380737Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:25.7381049Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:25.7381369Z #define __linux 1
2025-05-07T20:26:25.7381600Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:25.7381871Z #define cudaDeviceMask 0xff
2025-05-07T20:26:25.7382142Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:25.7382441Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:25.7382744Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:25.7383064Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:25.7383460Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:25.7383769Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:25.7384064Z #define _BITS_TYPES_H 1
2025-05-07T20:26:25.7384357Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:25.7384691Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:25.7385002Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:25.7385284Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:25.7385576Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:25.7385861Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:25.7386640Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:25.7387453Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:25.7387733Z #define __unix 1
2025-05-07T20:26:25.7387952Z #define MATH_ERRNO 1
2025-05-07T20:26:25.7388198Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:25.7388479Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:25.7388753Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:25.7389042Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:25.7389332Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.7389616Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:25.7390084Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:25.7390548Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:25.7390840Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:25.7391100Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:25.7391377Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:25.7391660Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:25.7391929Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:25.7392169Z #define __SIZE_T 
2025-05-07T20:26:25.7392416Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:25.7392738Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:25.7393041Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:25.7393383Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:25.7393648Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:25.7394037Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:25.7394465Z #define __WAIT_STATUS void *
2025-05-07T20:26:25.7394724Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:25.7394992Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:25.7395263Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:25.7395545Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:25.7395823Z #define __WINT_MIN__ 0U
2025-05-07T20:26:25.7396399Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:25.7397042Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:25.7397343Z #define WUNTRACED 2
2025-05-07T20:26:25.7397575Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:25.7397860Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:25.7398146Z #define NZERO 20
2025-05-07T20:26:25.7398378Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:25.7398663Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:25.7398951Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:25.7399239Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:25.7399501Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.7399780Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:25.7400060Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:25.7400340Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:25.7400611Z #define EXIT_FAILURE 1
2025-05-07T20:26:25.7400853Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:25.7401120Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:25.7401383Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:25.7401640Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:25.7401924Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:25.7402259Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:25.7402703Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:25.7403009Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:25.7403264Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:25.7403532Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:25.7403828Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:25.7404139Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:25.7404424Z #define SEEK_DATA 3
2025-05-07T20:26:25.7404658Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:25.7404955Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:25.7405370Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:25.7405760Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:25.7406011Z #define __INT64_C(c) c ## L
2025-05-07T20:26:25.7406282Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:25.7406613Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:25.7406945Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:25.7407229Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:25.7407533Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:25.7407832Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:25.7408094Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:25.7408338Z #define WSTOPPED 2
2025-05-07T20:26:25.7408572Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:25.7408861Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:25.7409119Z #define FP_NORMAL 4
2025-05-07T20:26:25.7409361Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:25.7409656Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:25.7409898Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:25.7410155Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:25.7410450Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:25.7410727Z #define cudaTextureType1D 0x01
2025-05-07T20:26:25.7411000Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:25.7411266Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:25.7411536Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:25.7411834Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:25.7412346Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:25.7412824Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:25.7413117Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:25.7413378Z #define _POSIX_SOURCE 1
2025-05-07T20:26:25.7413630Z #define cudaTextureType2D 0x02
2025-05-07T20:26:25.7413900Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:25.7414170Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:25.7414491Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:25.7414762Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:25.7415080Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:25.7415417Z #define cudaTextureType3D 0x03
2025-05-07T20:26:25.7415691Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:25.7415948Z #define CLOCK_REALTIME 0
2025-05-07T20:26:25.7416205Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:25.7416484Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:25.7416792Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:25.7417078Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:25.7417359Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:25.7417651Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:25.7417919Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:25.7418226Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:25.7418523Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:25.7418800Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:25.7419060Z #define __GLIBC__ 2
2025-05-07T20:26:25.7419282Z #define __END_DECLS }
2025-05-07T20:26:25.7419517Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:25.7419889Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:25.7420269Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:25.7420518Z #define WCONTINUED 8
2025-05-07T20:26:25.7420756Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:25.7421016Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:25.7421343Z #define _ALLOCA_H 1
2025-05-07T20:26:25.7421659Z #define __host__ __location__(host)
2025-05-07T20:26:25.7422089Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:25.7422523Z #define __SLONG32_TYPE int
2025-05-07T20:26:25.7422811Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:25.7423125Z #define _SYS_SELECT_H 1
2025-05-07T20:26:25.7423369Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:25.7423616Z #define _IOS_NOCREATE 32
2025-05-07T20:26:25.7423870Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:25.7424156Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:25.7424447Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:25.7424737Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:25.7425025Z #define __global__ __location__(global)
2025-05-07T20:26:25.7425315Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:25.7425576Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:25.7425853Z #define __DBL_DIG__ 15
2025-05-07T20:26:25.7426094Z #define TIME_UTC 1
2025-05-07T20:26:25.7426320Z #define __FLT32_DIG__ 6
2025-05-07T20:26:25.7426651Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:25.7427049Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:25.7427362Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:25.7427677Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:25.7427976Z #define _G_BUFSIZ 8192
2025-05-07T20:26:25.7428277Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:25.7428649Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:25.7428951Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:25.7429229Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:25.7429522Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:25.7429774Z #define __GXX_WEAK__ 1
2025-05-07T20:26:25.7430024Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.7430334Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:25.7430599Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:25.7430900Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:25.7431347Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:25.7431631Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:25.7431920Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:25.7432215Z #define _G_config_h 1
2025-05-07T20:26:25.7432496Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:25.7432836Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:25.7433110Z #define _GCC_WCHAR_T 
2025-05-07T20:26:25.7433344Z #define TMP_MAX 238328
2025-05-07T20:26:25.7433588Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:25.7433850Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:25.7434114Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.7434397Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:25.7434680Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:25.7434960Z #define _IO_SKIPWS 01
2025-05-07T20:26:25.7435366Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:25.7435833Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:25.7436102Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:25.7436442Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:25.7436811Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:25.7437176Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:25.7437543Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:25.7437802Z #define le32toh(x) (x)
2025-05-07T20:26:25.7438033Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:25.7438292Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:25.7438636Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:25.7438996Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:25.7439392Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:25.7439806Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:25.7440392Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:25.7440707Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:25.7441107Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:25.7441399Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:25.7441929Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:25.7442430Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:25.7442743Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:25.7443089Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:25.7443407Z #define _WCHAR_T_ 
2025-05-07T20:26:25.7443639Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:25.7444002Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:25.7444382Z #define RTSIG_MAX 32
2025-05-07T20:26:25.7444610Z #define _STDDEF_H 
2025-05-07T20:26:25.7444845Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:25.7445115Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:25.7445374Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:25.7445717Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:25.7446108Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:25.7446444Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:25.7446739Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:25.7447204Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:25.7447737Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:25.7448111Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:25.7448435Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:25.7448745Z #define __unix__ 1
2025-05-07T20:26:25.7448988Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.7449276Z #define __INT_WIDTH__ 32
2025-05-07T20:26:25.7449521Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:25.7449762Z #define _IONBF 2
2025-05-07T20:26:25.7450208Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:25.7450966Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:25.7451637Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:25.7451904Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:25.7452177Z #define __UINT16_C(c) c
2025-05-07T20:26:25.7452417Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:25.7452735Z #define STA_DEL 0x0020
2025-05-07T20:26:25.7452996Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:25.7453252Z #define __id_t_defined 
2025-05-07T20:26:25.7453531Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:25.7453984Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:25.7454415Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:25.7454691Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:25.7454958Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:25.7455209Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:25.7455483Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:25.7455765Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:25.7456039Z #define SING 2
2025-05-07T20:26:25.7456259Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:25.7456531Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7456836Z #define cudaStreamDefault 0x00
2025-05-07T20:26:25.7457180Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:25.7457552Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:25.7457829Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:25.7458096Z #define __gnu_linux__ 1
2025-05-07T20:26:25.7458337Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:25.7458598Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:25.7458842Z #define MAX_INPUT 255
2025-05-07T20:26:25.7459086Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:25.7459419Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:25.7459788Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:25.7460188Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:25.7460519Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:25.7460928Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:25.7461409Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:25.7461745Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:25.7462105Z #define _Mfloat_ float
2025-05-07T20:26:25.7462364Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:25.7462678Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.7462968Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:25.7463454Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:25.7463947Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7464226Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:25.7464558Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:25.7464911Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:25.7465222Z #define __USE_ISOC11 1
2025-05-07T20:26:25.7465462Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:25.7465692Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:25.7465945Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:25.7466214Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:25.7466509Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:25.7466840Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:25.7467158Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:25.7467486Z #define __THROW throw ()
2025-05-07T20:26:25.7467741Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:25.7468033Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7468394Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.7468746Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:25.7469024Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:25.7475995Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:25.7476297Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:25.7476577Z #define L_tmpnam 20
2025-05-07T20:26:25.7476912Z #define ___int_wchar_t_h 
2025-05-07T20:26:25.7477257Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:25.7477644Z #define isascii(c) __isascii (c)
2025-05-07T20:26:25.7477908Z #define _T_PTRDIFF 
2025-05-07T20:26:25.7478216Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:25.7478574Z #define toascii(c) __toascii (c)
2025-05-07T20:26:25.7478837Z #define __GNUC__ 11
2025-05-07T20:26:25.7479085Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:25.7479393Z #define __GXX_RTTI 1
2025-05-07T20:26:25.7479620Z #define __pie__ 2
2025-05-07T20:26:25.7479830Z #define __MMX__ 1
2025-05-07T20:26:25.7480051Z #define __cudaCDP2Malloc 
2025-05-07T20:26:25.7480309Z #define __timespec_defined 1
2025-05-07T20:26:25.7480557Z #define L_ctermid 9
2025-05-07T20:26:25.7480792Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:25.7481099Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:25.7481501Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:25.7481883Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:25.7482153Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:25.7482446Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:25.7482796Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:25.7483113Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:25.7483375Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:25.7483812Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:25.7484559Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:25.7485158Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:25.7485461Z #define __USE_SVID 1
2025-05-07T20:26:25.7485710Z #define __constant__ __location__(constant)
2025-05-07T20:26:25.7486016Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:25.7486402Z #define __device__ __location__(device)
2025-05-07T20:26:25.7486730Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:25.7487050Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:25.7487316Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:25.7487596Z #define CUDART_DEVICE __device__
2025-05-07T20:26:25.7487944Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:25.7488313Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:25.7488595Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:25.7488960Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:25.7489333Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:25.7489579Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:25.7489947Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:25.7490362Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:25.7490677Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:25.7490953Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:25.7491219Z #define NGROUPS_MAX 65536
2025-05-07T20:26:25.7491471Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:25.7491733Z #define __USE_ISOC95 1
2025-05-07T20:26:25.7491952Z #define _TIME_H 1
2025-05-07T20:26:25.7492218Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:25.7492537Z #define __USE_ISOC99 1
2025-05-07T20:26:25.7492857Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:25.7493216Z #define HOST_NAME_MAX 64
2025-05-07T20:26:25.7493462Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:25.7493722Z #define _IOS_ATEND 4
2025-05-07T20:26:25.7493952Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:25.7494275Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:25.7494675Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:25.7495015Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:25.7495297Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:25.7495625Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:25.7496081Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:25.7496339Z #define _STDIO_H 1
2025-05-07T20:26:25.7496735Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:25.7497198Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:25.7497555Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:25.7497931Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:25.7498220Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:25.7498482Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:25.7498752Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:25.7499042Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:25.7499339Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7499653Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:25.7499927Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:25.7500203Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:25.7500519Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:25.7500791Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:25.7501322Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:25.7501675Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:25.7502039Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:25.7502285Z #define __USE_XOPEN 1
2025-05-07T20:26:25.7502523Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:25.7502959Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:25.7503394Z #define __USE_XOPEN2K 1
2025-05-07T20:26:25.7503630Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:25.7503897Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:25.7504188Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:25.7504453Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:25.7504971Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:25.7505581Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:25.7505869Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:25.7506222Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:25.7506604Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:25.7506988Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:25.7507372Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:25.7507641Z #define __glibcxx_integral_traps true
2025-05-07T20:26:25.7507924Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:25.7508173Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:25.7508431Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:25.7508700Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:25.7508945Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:25.7509237Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:25.7509534Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:25.7509894Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:25.7510278Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:25.7510554Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:25.7510821Z #define _IO_UNITBUF 020000
2025-05-07T20:26:25.7511070Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:25.7511329Z #define __FD_SETSIZE 1024
2025-05-07T20:26:25.7511582Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:25.7511849Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:25.7512190Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:25.7512545Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:25.7512853Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:25.7513167Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:25.7513483Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:25.7513752Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:25.7514051Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:25.7514386Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:25.7514677Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:25.7515117Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:25.7515402Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:25.7515671Z #define __USE_POSIX199506 1
2025-05-07T20:26:25.7515915Z #define _FEATURES_H 1
2025-05-07T20:26:25.7516153Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:25.7516547Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:25.7516959Z #define __stub_getmsg 
2025-05-07T20:26:25.7517191Z #define _IO_FIXED 010000
2025-05-07T20:26:25.7517463Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:25.7517767Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:25.7518037Z #define __stub_setlogin 
2025-05-07T20:26:25.7518273Z #define __stub_fattach 
2025-05-07T20:26:25.7518512Z #define __cplusplus 201703L
2025-05-07T20:26:25.7518774Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:25.7519052Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:25.7519306Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:25.7519588Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:25.7520072Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:25.7520596Z #define _IO_INTERNAL 010
2025-05-07T20:26:25.7520835Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:25.7521170Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:25.7521521Z #define __dev_t_defined 
2025-05-07T20:26:25.7521752Z #define __DEPRECATED 1
2025-05-07T20:26:25.7521980Z #define __S32_TYPE int
2025-05-07T20:26:25.7522226Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:25.7522515Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:25.7522778Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:25.7523039Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:25.7523632Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:25.7524254Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:25.7524684Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:25.7525032Z #define OVERFLOW 3
2025-05-07T20:26:25.7525274Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:25.7525584Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:25.7525868Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.7526199Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:25.7526526Z #define __SSE2_MATH__ 1
2025-05-07T20:26:25.7526769Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:25.7527078Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.7527374Z #define _IO_STDIO_H 
2025-05-07T20:26:25.7527615Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:25.7527907Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:25.7528220Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:25.7528514Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7528821Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:25.7529079Z #define __amd64 1
2025-05-07T20:26:25.7529307Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:25.7529573Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:25.7529844Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:25.7530130Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:25.7530438Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:25.7530699Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:25.7530994Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:25.7531251Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:25.7531493Z #define __bounded 
2025-05-07T20:26:25.7531716Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:25.7532001Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:25.7532276Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:25.7532536Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:25.7532802Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7533113Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:25.7533523Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:25.7533925Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:25.7534275Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:25.7534613Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:25.7534950Z #define STA_PLL 0x0001
2025-05-07T20:26:25.7535197Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:25.7535458Z #define __GNUG__ 11
2025-05-07T20:26:25.7535687Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:25.7535959Z #define _T_WCHAR 
2025-05-07T20:26:25.7536199Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:25.7536493Z #define __specialization_static 
2025-05-07T20:26:25.7536793Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:25.7537107Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:25.7537371Z #define cudaArraySparse 0x40
2025-05-07T20:26:25.7537632Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:25.7537884Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:25.7538172Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:25.7538469Z #define _WCHAR_T 
2025-05-07T20:26:25.7538704Z #define __cudaCDP2Free 
2025-05-07T20:26:25.7539338Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:25.7540498Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:25.7540960Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:25.7541453Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:25.7541736Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:25.7541996Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:25.7542334Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:25.7542685Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:25.7542923Z #define __NO_CTYPE 1
2025-05-07T20:26:25.7543156Z #define __stub_bdflush 
2025-05-07T20:26:25.7543519Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:25.7544086Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:25.7544403Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:25.7544678Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:25.7544959Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:25.7545261Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:25.7545563Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:25.7545907Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:25.7546248Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:25.7546538Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:25.7546926Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:25.7547383Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:25.7547733Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:25.7548021Z #define _IO_STDIO 040000
2025-05-07T20:26:25.7548345Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:25.7548743Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:25.7549071Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:25.7549363Z #define _PTRDIFF_T 
2025-05-07T20:26:25.7549583Z #define _MOVE_H 1
2025-05-07T20:26:25.7549820Z #define __cpp_hex_float 201603L
2025-05-07T20:26:25.7550084Z #define ADJ_TAI 0x0080
2025-05-07T20:26:25.7550310Z #define __ptrvalue 
2025-05-07T20:26:25.7550542Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:25.7550801Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:25.7551084Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:25.7551396Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:25.7551653Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:25.7551937Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:25.7552340Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:25.7552719Z #define __USE_GNU 1
2025-05-07T20:26:25.7552947Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:25.7553227Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:25.7553504Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:25.7554040Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:25.7554426Z #define WEXITED 4
2025-05-07T20:26:25.7554645Z #define _IO_NO_READS 4
2025-05-07T20:26:25.7554949Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:25.7555295Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:25.7555576Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:25.7555882Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:25.7556199Z #define __uid_t_defined 
2025-05-07T20:26:25.7556454Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:25.7556744Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:25.7557016Z #define WNOHANG 1
2025-05-07T20:26:25.7557264Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:25.7557574Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:25.7557844Z #define cudaEventDefault 0x00
2025-05-07T20:26:25.7558153Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:25.7558483Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:25.7558723Z #define __x86_64 1
2025-05-07T20:26:25.7558954Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:25.7559349Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.7559826Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:25.7560317Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:25.7560753Z #define __PTRDIFF_T 
2025-05-07T20:26:25.7561079Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:25.7561450Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:25.7561729Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.7562022Z #define _Mlong_double_ long double
2025-05-07T20:26:25.7562306Z #define __cpp_lambdas 200907L
2025-05-07T20:26:25.7562562Z #define _IO_DEC 020
2025-05-07T20:26:25.7562795Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:25.7563161Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:25.7563457Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:25.7563744Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:25.7564009Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:25.7564304Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:25.7564632Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:25.7564908Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:25.7565173Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:25.7565489Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:25.7565857Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:25.7566236Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:25.7566522Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:25.7566822Z #define __cpp_template_auto 201606L
2025-05-07T20:26:25.7567182Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:25.7567550Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:25.7567828Z #define __key_t_defined 
2025-05-07T20:26:25.7568083Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:25.7568450Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:25.7568920Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:25.7569288Z #define __GNUC_VA_LIST 
2025-05-07T20:26:25.7569619Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:25.7570010Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:25.7570277Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:25.7570562Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:25.7570859Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:25.7571111Z #define __WCOREFLAG 0x80
2025-05-07T20:26:25.7571368Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:25.7571671Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:25.7571957Z #define __LP64__ 1
2025-05-07T20:26:25.7572206Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:25.7572621Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:25.7572910Z #define _IO_off64_t __off64_t
2025-05-07T20:26:25.7573174Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7573433Z #define __time_t_defined 1
2025-05-07T20:26:25.7573688Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:25.7574038Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:25.7574401Z #define __USE_UNIX98 1
2025-05-07T20:26:25.7574646Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:25.7574923Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:25.7575194Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:25.7575491Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:25.7575804Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:25.7576068Z #define SEEK_CUR 1
2025-05-07T20:26:25.7576295Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.7576570Z #define _ASSERT_H 1
2025-05-07T20:26:25.7577139Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:25.7577762Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:25.7578042Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:25.7578300Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:25.7578567Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:25.7578842Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:25.7579217Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:25.7579632Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:25.7580280Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:25.7580925Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:25.7581285Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:25.7581642Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:25.7582134Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:25.7582417Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.7582702Z #define cudaArrayDefault 0x00
2025-05-07T20:26:25.7582990Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:25.7583287Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:25.7583571Z #define TLOSS 5
2025-05-07T20:26:25.7583785Z #define __ssize_t_defined 
2025-05-07T20:26:25.7584041Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:25.7584317Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:25.7584607Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:25.7584903Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:25.7585267Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:25.7585651Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:25.7585939Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:25.7586229Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:25.7586539Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:25.7586840Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:25.7587132Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:25.7587387Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:25.7587724Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:25.7588086Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:25.7588329Z #define __cdecl 
2025-05-07T20:26:25.7588567Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:25.7588902Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:25.7589233Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:25.7589484Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:25.7589766Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:25.7590067Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:25.7590331Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:25.7590643Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:25.7590976Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:25.7591382Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:25.7591920Z #define ADJ_NANO 0x2000
2025-05-07T20:26:25.7592229Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:25.7592589Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:25.7592875Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:25.7593139Z #define __FLT_DIG__ 6
2025-05-07T20:26:25.7593492Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:25.7593883Z #define __NO_INLINE__ 1
2025-05-07T20:26:25.7594187Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:25.7594541Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:25.7594797Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:25.7595065Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:25.7595359Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:25.7595627Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:25.7595929Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:25.7596224Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:25.7603674Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:25.7604099Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:25.7604452Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:25.7604803Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:25.7605047Z #define MAX_CANON 255
2025-05-07T20:26:25.7605278Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:25.7605535Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:25.7605807Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:25.7606094Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:25.7606408Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:25.7606711Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:25.7606993Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:25.7607317Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:25.7607629Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:25.7607996Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:25.7608304Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:25.7608601Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:25.7608876Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:25.7609193Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:25.7609490Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:25.7609754Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:25.7610007Z #define _SYS_TYPES_H 1
2025-05-07T20:26:25.7610250Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:25.7610514Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:25.7610760Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:25.7610994Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:25.7611267Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:25.7611555Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:25.7611808Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:25.7612098Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:25.7612369Z #define FP_SUBNORMAL 3
2025-05-07T20:26:25.7612621Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:25.7612909Z #define _INITIALIZER_LIST 
2025-05-07T20:26:25.7613159Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:25.7613401Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:25.7613676Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:25.7613965Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:25.7614219Z #define _IO_file_flags _flags
2025-05-07T20:26:25.7614475Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:25.7614724Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:25.7614997Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:25.7615271Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:25.7615538Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:25.7615911Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:25.7616304Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:25.7616612Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:25.7616882Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:25.7617136Z #define _BSD_SOURCE 1
2025-05-07T20:26:25.7617456Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:25.7618297Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:25.7619135Z #define __catch(X) catch(X)
2025-05-07T20:26:25.7619397Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:25.7619688Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:25.7619958Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:25.7620215Z #define __STRING(x) #x
2025-05-07T20:26:25.7620458Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:25.7620728Z #define _T_PTRDIFF_ 
2025-05-07T20:26:25.7620976Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:25.7621357Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:25.7621634Z #define __unbounded 
2025-05-07T20:26:25.7621873Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.7622168Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:25.7622457Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.7622753Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:25.7623035Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:25.7623332Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:25.7623656Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:25.7623965Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:25.7624246Z #define __managed__ __location__(managed)
2025-05-07T20:26:25.7624542Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:25.7624942Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:25.7625364Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:25.7625624Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:25.7625998Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:25.7626394Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:25.7626729Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:25.7627024Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:25.7627359Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:25.7627634Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:25.7627922Z #define _CRTIMP 
2025-05-07T20:26:25.7628148Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:25.7628454Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:25.7628776Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:25.7629134Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:25.7629550Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7629870Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:25.7630148Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:25.7630434Z #define __SIZE_T__ 
2025-05-07T20:26:25.7630652Z #define __stub_gtty 
2025-05-07T20:26:25.7630875Z #define __pid_t_defined 
2025-05-07T20:26:25.7631131Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:25.7631444Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.7631761Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:25.7632052Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:25.7632297Z #define __need_clockid_t 
2025-05-07T20:26:25.7632535Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:25.7632787Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:25.7633106Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:25.7633423Z #define _IO_HEX 0100
2025-05-07T20:26:25.7633682Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:25.7634016Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:25.7634324Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:25.7634596Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:25.7635009Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:25.7635449Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:25.7635755Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:25.7636062Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:25.7636257Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:25.7636364Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:25.7636448Z #define __stub_sstk 
2025-05-07T20:26:25.7636543Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:25.7636706Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:25.7636787Z #define __wur 
2025-05-07T20:26:25.7636909Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:25.7636998Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:25.7637081Z #define _IO_OCT 040
2025-05-07T20:26:25.7637180Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:25.7637270Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:25.7637361Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:25.7637493Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:25.7637585Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:25.7637689Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:25.7637945Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:25.7638091Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:25.7638225Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:25.7638349Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:25.7638440Z #define __off64_t_defined 
2025-05-07T20:26:25.7638541Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:25.7638628Z #define __FLT128_DIG__ 33
2025-05-07T20:26:25.7638734Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:25.7638832Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:25.7638915Z #define __INT32_C(c) c
2025-05-07T20:26:25.7639013Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:25.7639116Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:25.7639213Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:25.7639303Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:25.7639395Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:25.7639491Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:25.7639629Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:25.7639822Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:25.7639920Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:25.7640024Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:25.7640345Z #define __have_pthread_attr_t 1
2025-05-07T20:26:25.7640451Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:25.7640681Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:25.7640791Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:25.7640893Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:25.7640992Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:25.7641078Z #define htole32(x) (x)
2025-05-07T20:26:25.7641329Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:25.7641463Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:25.7641564Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:25.7641726Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:25.7641872Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:25.7642003Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:25.7642145Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:25.7642237Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:25.7642338Z #define cudaArrayLayered 0x01
2025-05-07T20:26:25.7642511Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:25.7642621Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:25.7642716Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:25.7642822Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:25.7642903Z #define unix 1
2025-05-07T20:26:25.7642999Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:25.7643092Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:25.7643185Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:25.7643305Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:25.7643392Z #define __USE_POSIX 1
2025-05-07T20:26:25.7643484Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:25.7643625Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:25.7643870Z #define __THROWNL throw ()
2025-05-07T20:26:25.7643963Z #define __cpp_rtti 199711L
2025-05-07T20:26:25.7644075Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:25.7644165Z #define __PMT(args) args
2025-05-07T20:26:25.7644279Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7644435Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:25.7644551Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:25.7644647Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:25.7644745Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:25.7644835Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:25.7645229Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:25.7645327Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:25.7645421Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:25.7645531Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:25.7645678Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:25.7645767Z #define _WCHAR_T_H 
2025-05-07T20:26:25.7645860Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:25.7645950Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:25.7646042Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:25.7646141Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:25.7646237Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:25.7646332Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:25.7646441Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:25.7646520Z #define __ELF__ 1
2025-05-07T20:26:25.7646628Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:25.7646729Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:25.7646815Z #define STA_INS 0x0010
2025-05-07T20:26:25.7646921Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:25.7647097Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:25.7647190Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:25.7647433Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.7647556Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7647673Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7647772Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:25.7647876Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:25.7647979Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:25.7648136Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:25.7648295Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:25.7648399Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:25.7648760Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:25.7648946Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:25.7649085Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:25.7649211Z #define __FLT_RADIX__ 2
2025-05-07T20:26:25.7649360Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:25.7649607Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:25.7649758Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:25.7649864Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:25.7649968Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:25.7650067Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:25.7650168Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:25.7650273Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:25.7650358Z #define WORD_BIT 32
2025-05-07T20:26:25.7650446Z #define _IO_USER_BUF 1
2025-05-07T20:26:25.7650538Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:25.7650641Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7650758Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:25.7650861Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:25.7650968Z #define __long_double_t long double
2025-05-07T20:26:25.7651061Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:25.7651153Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:25.7651559Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:25.7651748Z #define __k8 1
2025-05-07T20:26:25.7651945Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:25.7652119Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:25.7652235Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:25.7652342Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:25.7652441Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:25.7652544Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:25.7652643Z #define __blksize_t_defined 
2025-05-07T20:26:25.7652743Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:25.7652842Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:25.7652954Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:25.7653045Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:25.7653151Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:25.7653244Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:25.7653342Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:25.7653609Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:25.7653951Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:25.7654058Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:25.7654153Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:25.7654235Z #define SEEK_SET 0
2025-05-07T20:26:25.7654333Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:25.7654428Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:25.7654620Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:25.7654725Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:25.7654826Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:25.7654918Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:25.7655009Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:25.7655404Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:25.7655514Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:25.7655609Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:25.7655698Z #define __stub_sigreturn 
2025-05-07T20:26:25.7655935Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:25.7656029Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:25.7656118Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:25.7656219Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:25.7656302Z #define CLOCK_TAI 11
2025-05-07T20:26:25.7656409Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:25.7656504Z #define __restrict_arr 
2025-05-07T20:26:25.7656618Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:25.7656761Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:25.7657289Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:25.7657478Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:25.7657573Z #define __USE_MISC 1
2025-05-07T20:26:25.7657679Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:25.7657779Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:25.7657872Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:25.7657959Z #define __LDBL_DIG__ 18
2025-05-07T20:26:25.7658056Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:25.7658165Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:25.7658259Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:25.7658367Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:25.7658451Z #define __x86_64__ 1
2025-05-07T20:26:25.7658535Z #define _SIZE_T_ 
2025-05-07T20:26:25.7659422Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:25.7659614Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:25.7659713Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:25.7659839Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:25.7659960Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:25.7660068Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:25.7660178Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:25.7660305Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:25.7660450Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:25.7660548Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:25.7661005Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:25.7661247Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:25.7661407Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:25.7661516Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:25.7661612Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:25.7661701Z #define STA_FLL 0x0008
2025-05-07T20:26:25.7661849Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:25.7661946Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:25.7662071Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7662189Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:25.7662276Z #define __stub_revoke 
2025-05-07T20:26:25.7662372Z #define __timer_t_defined 1
2025-05-07T20:26:25.7662513Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:25.7662607Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:25.7662715Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:25.7662828Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:25.7663020Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:25.7663136Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:25.7663248Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:25.7663347Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:25.7663499Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:25.7663594Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:25.7663683Z #define _IO_off_t __off_t
2025-05-07T20:26:25.7663779Z #define __FLT64_DIG__ 15
2025-05-07T20:26:25.7664003Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:25.7664101Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:25.7664238Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7664362Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:25.7664464Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:25.7664568Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:25.7664686Z #define NULL __null
2025-05-07T20:26:25.7664883Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:25.7665037Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:25.7665180Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:25.7665302Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7665400Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:25.7665520Z #define FP_ZERO 2
2025-05-07T20:26:25.7665806Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:25.7666148Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:25.7666503Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7666791Z #define __WCHAR_T__ 
2025-05-07T20:26:25.7667029Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:25.7667399Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:25.7667843Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:25.7668191Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:25.7668485Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:25.7668814Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:25.7669266Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:25.7669624Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:25.7669931Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:25.7670181Z #define _SIGSET_H_types 1
2025-05-07T20:26:25.7670482Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:25.7670796Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:25.7671140Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:25.7671488Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:25.7671806Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:25.7672156Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:25.7672490Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:25.7672868Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:25.7673275Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:25.7673641Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:25.7673928Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:25.7674221Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:25.7674485Z #define STA_MODE 0x4000
2025-05-07T20:26:25.7674746Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:25.7675057Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:25.7675365Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:25.7675672Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:25.7675962Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:25.7676245Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:25.7676529Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:25.7676814Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:25.7677118Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:25.7677392Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7677694Z #define __SEG_FS 1
2025-05-07T20:26:25.7677918Z #define _IO_size_t size_t
2025-05-07T20:26:25.7678251Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:25.7678545Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:25.7678807Z #define __stub_lchmod 
2025-05-07T20:26:25.7679041Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:25.7679319Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7679624Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:25.7679886Z #define __SEG_GS 1
2025-05-07T20:26:25.7680195Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:25.7680574Z #define _IOS_APPEND 8
2025-05-07T20:26:25.7680814Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:25.7681075Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:25.7681336Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:25.7681620Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:25.7681894Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:25.7682172Z #define htole16(x) (x)
2025-05-07T20:26:25.7682427Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:25.7682727Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:25.7682998Z #define __INT16_TYPE__ short int
2025-05-07T20:26:25.7683278Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:25.7683578Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:25.7683891Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:25.7684221Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:25.7684529Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:25.7684769Z #define __WCLONE 0x80000000
2025-05-07T20:26:25.7685025Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:25.7685280Z #define SEEK_HOLE 4
2025-05-07T20:26:25.7685500Z #define TIMER_ABSTIME 1
2025-05-07T20:26:25.7685745Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:25.7686009Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:25.7686343Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:25.7686730Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7687035Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:25.7687320Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:25.7687745Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7688043Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:25.7688345Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:25.7688585Z #define linux 1
2025-05-07T20:26:25.7688808Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:25.7689083Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:25.7689383Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:25.7689661Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:25.7689941Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:25.7690283Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:25.7690624Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:25.7690903Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7691183Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:25.7691460Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:25.7691706Z #define htole64(x) (x)
2025-05-07T20:26:25.7691946Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:25.7692273Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:25.7692596Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:25.7693311Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:25.7693981Z #define __USE_POSIX2 1
2025-05-07T20:26:25.7694231Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:25.7694503Z #define __WALL 0x40000000
2025-05-07T20:26:25.7694751Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:25.7695013Z #define _XLOCALE_H 1
2025-05-07T20:26:25.7695252Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:25.7695517Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:25.7695799Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:25.7696078Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:25.7696363Z #define __EXCEPTIONS 1
2025-05-07T20:26:25.7696613Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:25.7697080Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:25.7697464Z #define __WORDSIZE 64
2025-05-07T20:26:25.7697709Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:25.7697959Z #define _STL_RELOPS_H 1
2025-05-07T20:26:25.7698205Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:25.7698468Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:25.7698754Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:25.7699036Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:25.7699296Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:25.7699775Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:25.7700406Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:25.7700849Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:25.7701246Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:25.7701538Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:25.7701838Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:25.7702156Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:25.7702459Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:25.7702849Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:25.7703222Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:25.7703497Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:25.7703765Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:25.7704135Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:25.7704529Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:25.7704826Z #define _STRING_H 1
2025-05-07T20:26:25.7705060Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:25.7705328Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:25.7705586Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:25.7705900Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:25.7706236Z #define __code_model_small__ 1
2025-05-07T20:26:25.7706496Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:25.7706749Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:25.7707151Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:25.7707458Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:25.7707733Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:25.7708253Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:25.7708770Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:25.7709024Z #define le64toh(x) (x)
2025-05-07T20:26:25.7709262Z #define FILENAME_MAX 4096
2025-05-07T20:26:25.7709569Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:25.7709924Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:25.7710223Z #define L_cuserid 9
2025-05-07T20:26:25.7710449Z #define __ino_t_defined 
2025-05-07T20:26:25.7710682Z #define __k8__ 1
2025-05-07T20:26:25.7717841Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:25.7718165Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:25.7718463Z #define __int8_t_defined 
2025-05-07T20:26:25.7718719Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:25.7718986Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:25.7719289Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:25.7719591Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:25.7719860Z #define _IOS_TRUNC 16
2025-05-07T20:26:25.7720118Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:25.7720482Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:25.7720823Z #define __HAVE_COLUMN 
2025-05-07T20:26:25.7721055Z #define __stub_fdetach 
2025-05-07T20:26:25.7721611Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:25.7722182Z #define __pic__ 2
2025-05-07T20:26:25.7722429Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7722738Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:25.7723006Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:25.7723381Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:25.7723667Z #define __stub_chflags 
2025-05-07T20:26:25.7723905Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:25.7724146Z #define __need_IOV_MAX 
2025-05-07T20:26:25.7724399Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:25.7724718Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:25.7725015Z #define __cpp_decltype 200707L
2025-05-07T20:26:25.7725281Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:25.7725561Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:25.7725827Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:25.7726109Z #define TTY_NAME_MAX 32
2025-05-07T20:26:25.7726425Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:25.7726804Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7727186Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:25.7727564Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:25.7727865Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:25.7728131Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:25.7728375Z #define __import__ 
2025-05-07T20:26:25.7728597Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:25.7728888Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:25.7729197Z #define __export__ 
2025-05-07T20:26:25.7729446Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:25.7729758Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:25.7730089Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:25.7730438Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:25.7730697Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:25.7730945Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:25.7731210Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:25.7731487Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:25.7731813Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:25.7732130Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:25.7732415Z #define WNOWAIT 0x01000000
2025-05-07T20:26:25.7732749Z #define PLOSS 6
2025-05-07T20:26:25.7732970Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:25.7733416Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:25.7733858Z #define EXIT_SUCCESS 0
2025-05-07T20:26:25.7734094Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:25.7734365Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:25.7734636Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:25.7734908Z #define __thread__ __thread
2025-05-07T20:26:25.7735162Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:25.7735427Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:25.7735690Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:25.7736109Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:25.7736545Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:25.7736848Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:25.7737092Z #define __linux__ 1
2025-05-07T20:26:25.7737325Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:25.7737621Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:25.7737932Z #define __S16_TYPE short int
2025-05-07T20:26:25.7738440Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:25.7738985Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:25.7739367Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:25.7739752Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:25.7740022Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:25.7740735Z #define _T_SIZE_ 
2025-05-07T20:26:25.7741084Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:25.7741504Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:25.7741810Z #define _PSTL_VERSION 12000
2025-05-07T20:26:25.7742087Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:25.7742395Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:25.7742804Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:25.7743119Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:25.7743431Z #define _IOS_INPUT 1
2025-05-07T20:26:25.7743668Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:25.7743930Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:25.7744221Z #define __INT64_TYPE__ long int
2025-05-07T20:26:25.7744491Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:25.7744757Z #define __shared__ __location__(shared)
2025-05-07T20:26:25.7745042Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:25.7745354Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:25.7745688Z #define __gid_t_defined 
2025-05-07T20:26:25.7745953Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:25.7746256Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:25.7746631Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:25.7747020Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:25.7747293Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:25.7747545Z #define ___int_size_t_h 
2025-05-07T20:26:25.7747799Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.7748125Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:25.7748496Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:25.7748846Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:25.7749130Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:25.7749397Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:25.7749669Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:25.7749964Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7750302Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:25.7750632Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:25.7750930Z #define __clock_t_defined 1
2025-05-07T20:26:25.7751189Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:25.7751588Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:25.7751959Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:25.7752220Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:25.7752728Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:25.7753042Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:25.7753333Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:25.7753663Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:25.7754010Z #define __SSE__ 1
2025-05-07T20:26:25.7754237Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:25.7754512Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:25.7754780Z #define _CTYPE_H 1
2025-05-07T20:26:25.7755001Z #define __sigset_t_defined 
2025-05-07T20:26:25.7755257Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:25.7755527Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:25.7755774Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:25.7756013Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:25.7756287Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:25.7756531Z #define __SM_70_RT_H__ 
2025-05-07T20:26:25.7756779Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:25.7757061Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:25.7757341Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:25.7757666Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:25.7758016Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:25.7758284Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:25.7758581Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:25.7758841Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:25.7759075Z #define __amd64__ 1
2025-05-07T20:26:25.7759295Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:25.7759548Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:25.7760009Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.7760462Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:25.7760723Z #define EOF (-1)
2025-05-07T20:26:25.7760948Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:25.7761216Z #define __USE_POSIX199309 1
2025-05-07T20:26:25.7761564Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:25.7761842Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:25.7762102Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:25.7762372Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:25.7762660Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:25.7762950Z #define ____mbstate_t_defined 1
2025-05-07T20:26:25.7763206Z #define STA_NANO 0x2000
2025-05-07T20:26:25.7763446Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:25.7763711Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:25.7763960Z #define _IO_LINKED 0x80
2025-05-07T20:26:25.7764204Z #define __cpp_lib_launder 201606
2025-05-07T20:26:25.7764474Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:25.7764735Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:25.7765016Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:25.7765287Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:25.7765593Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:25.7765940Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7766245Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:25.7766528Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:25.7766794Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:25.7767049Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:25.7767335Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:25.7767683Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.7768098Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:25.7768577Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:25.7768941Z #define __stub_stty 
2025-05-07T20:26:25.7769247Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:25.7769595Z #define le16toh(x) (x)
2025-05-07T20:26:25.7769838Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:25.7770212Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:25.7770560Z #define _SIZET_ 
2025-05-07T20:26:25.7770780Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:25.7771138Z #define _SVID_SOURCE 1
2025-05-07T20:26:25.7771365Z #define _LP64 1
2025-05-07T20:26:25.7771581Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:25.7771962Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:25.7772401Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:25.7772684Z #define __UINT8_C(c) c
2025-05-07T20:26:25.7772919Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:25.7773174Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:25.7773447Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:25.7773738Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:25.7773993Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:25.7774255Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:25.7774514Z #define CUDARTAPI 
2025-05-07T20:26:25.7774725Z #define IOV_MAX 1024
2025-05-07T20:26:25.7775005Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:25.7775342Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:25.7775628Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:25.7775894Z #define __wchar_t__ 
2025-05-07T20:26:25.7776146Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:25.7776418Z #define SEEK_END 2
2025-05-07T20:26:25.7776646Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:25.7776984Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:25.7777353Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:25.7777664Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:25.7777995Z #define ____FILE_defined 1
2025-05-07T20:26:25.7778271Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:25.7778572Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:25.7778835Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:25.7779085Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:25.7779501Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.7780072Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:25.7780396Z #define _IO_RIGHT 04
2025-05-07T20:26:25.7780627Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:25.7780982Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:25.7781438Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:25.7781720Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:25.7782024Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:25.7782297Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:25.7782572Z #define _STDDEF_H_ 
2025-05-07T20:26:25.7782878Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:25.7783250Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7783552Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:25.7783954Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:25.7784366Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.7784728Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:25.7785088Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:25.7785409Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:25.7785711Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:25.7786017Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:25.7786302Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:25.7786612Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:25.7786884Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:25.7787145Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:25.7787490Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:25.7787858Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:25.7788196Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:25.7788570Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:25.7788849Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:25.7789158Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:25.7789502Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:25.7789855Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:25.7790124Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:25.7790402Z #define P_tmpdir "/tmp"
2025-05-07T20:26:25.7790679Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:25.7790988Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:25.7791253Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:25.7791607Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:25.7792043Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:25.7792408Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:25.7792730Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:25.7793092Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:25.7793395Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:25.7793821Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:25.7794251Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:25.7794547Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:25.7794845Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:25.7795106Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:25.7795356Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:25.7795618Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:25.7795894Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:25.7796144Z #define __FXSR__ 1
2025-05-07T20:26:25.7796356Z #define _SIZE_T 
2025-05-07T20:26:25.7796594Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:25.7796900Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:25.7797264Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:25.7797680Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:25.7798019Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:25.7798290Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:25.7798658Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:25.7799265Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:25.7799658Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:25.7799927Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:25.7800235Z #define FOPEN_MAX 16
2025-05-07T20:26:25.7800469Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:25.7800741Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:25.7801052Z #define __suseconds_t_defined 
2025-05-07T20:26:25.7801312Z #define __off_t_defined 
2025-05-07T20:26:25.7801548Z #define stderr stderr
2025-05-07T20:26:25.7801787Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:25.7802089Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:25.7802393Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:25.7802651Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:25.7803220Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:25.7803826Z #define __mode_t_defined 
2025-05-07T20:26:25.7804071Z #define _GCC_SIZE_T 
2025-05-07T20:26:25.7804318Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.7804613Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:25.7804907Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:25.7805207Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:25.7805464Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:25.7805730Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:25.7806034Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:25.7806332Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:25.7806627Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:25.7806865Z #define __size_t__ 
2025-05-07T20:26:25.7807133Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:25.7807456Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:25.7807733Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:25.7808079Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:25.7808422Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:25.7808839Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:25.7809191Z #define _ENDIAN_H 1
2025-05-07T20:26:25.7809433Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:25.7809726Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:25.7810001Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:25.7810275Z #define __try try
2025-05-07T20:26:25.7810498Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:25.7810766Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:25.7811034Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:25.7811449Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:25.7811890Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:25.7812132Z #define __PIC__ 2
2025-05-07T20:26:25.7812377Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:25.7812702Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:25.7813051Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:25.7813383Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:25.7813648Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:25.7814004Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:25.7814386Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.7814666Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:25.7814946Z #define _IO_uid_t __uid_t
2025-05-07T20:26:25.7815202Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:25.7815510Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:25.7815824Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:25.7816136Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:25.7816481Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:25.7816789Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:25.7817092Z #define LONG_BIT 64
2025-05-07T20:26:25.7817334Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:25.7817713Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:25.7818042Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:25.7818363Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:25.7818619Z #define __blkcnt_t_defined 
2025-05-07T20:26:25.7819055Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:25.7819515Z #define __USE_LARGEFILE 1
2025-05-07T20:26:25.7819771Z #define __cpp_constexpr 201603L
2025-05-07T20:26:25.7820042Z #define CUDART_VERSION 12060
2025-05-07T20:26:25.7820308Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:25.7820571Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:25.7820831Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:25.7821259Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:25.7821652Z #define __lldiv_t_defined 1
2025-05-07T20:26:25.7821891Z #define __SSE2__ 1
2025-05-07T20:26:25.7822105Z #define _IOLBF 1
2025-05-07T20:26:25.7822338Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:25.7822628Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:25.7822911Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:25.7823203Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:25.7823478Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:25.7823776Z #define __INT32_TYPE__ int
2025-05-07T20:26:25.7824027Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:25.7824296Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:25.7824584Z #define __cpp_exceptions 199711L
2025-05-07T20:26:25.7824861Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:25.7825150Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:25.7825435Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:25.7825711Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:25.7826086Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:25.7826435Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:25.7826705Z #define __SWORD_TYPE long int
2025-05-07T20:26:25.7826976Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:25.7827247Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:25.7827611Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:25.7827706Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:25.7827995Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:25.7828092Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:25.7828245Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:25.7828328Z #define _T_SIZE 
2025-05-07T20:26:25.7828435Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:25.7828568Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:25.7828697Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:25.7828793Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:25.7828891Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:25.7829014Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:25.7829109Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:25.7829215Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7829307Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:25.7829499Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:25.7829603Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:25.7829705Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:25.7829806Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:25.7829929Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7830014Z #define __PIE__ 2
2025-05-07T20:26:25.7830123Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:25.7830223Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:25.7830415Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:25.7830642Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:25.7830736Z #define __nlink_t_defined 
2025-05-07T20:26:25.7830864Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:25.7830984Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:25.7831072Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:25.7831440Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:25.7831568Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:25.7831674Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:25.7831787Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:25.7831882Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:25.7831973Z #define __FILE_defined 1
2025-05-07T20:26:25.7832158Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:25.7832254Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:25.7832351Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:25.7832468Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:25.7832586Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:25.7832698Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:25.7832803Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:25.7832890Z #define __INT16_C(c) c
2025-05-07T20:26:25.7832996Z #define __U32_TYPE unsigned int
2025-05-07T20:26:25.7833100Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:25.7833227Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:25.7833320Z #define __STDC__ 1
2025-05-07T20:26:25.7833418Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:25.7840383Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:25.7840554Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:25.7840723Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:25.7840817Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:25.7840924Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:25.7841024Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:25.7841140Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:25.7841257Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:25.7841356Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:25.7841459Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:25.7841546Z #define stdin stdin
2025-05-07T20:26:25.7841647Z #define __ino64_t_defined 
2025-05-07T20:26:25.7841920Z #define STA_CLK 0x8000
2025-05-07T20:26:25.7842020Z #define __clockid_t_defined 1
2025-05-07T20:26:25.7842168Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:25.7842345Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:25.7842492Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:25.7842643Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:25.7842787Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:25.7842894Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:25.7843093Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:25.7843190Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:25.7843710Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:25.7843801Z #define DOMAIN 1
2025-05-07T20:26:25.7843899Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:25.7843983Z #define __NVCC__ 1
2025-05-07T20:26:25.7844092Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:25.7844204Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.7844307Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:25.7844413Z #define __throw_exception_again throw
2025-05-07T20:26:25.7844506Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:25.7844595Z #define __EXCEPTION_H 1
2025-05-07T20:26:25.7844697Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.7844799Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:25.7845105Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:25.7845225Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:25.7845322Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:25.7845419Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:25.7845661Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:25.7845772Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:25.7845917Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:25.7846025Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.7846142Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:25.7846236Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:25.7846342Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:25.7846439Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.7846541Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:25.7846679Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:25.7846776Z #define __useconds_t_defined 
2025-05-07T20:26:25.7846874Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:25.7847062Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:25.7847211Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:25.7847297Z #define __SSE_MATH__ 1
2025-05-07T20:26:25.7847402Z #define _IO_wint_t wint_t
2025-05-07T20:26:25.7847501Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:25.7847596Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:25.7847691Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:25.7847804Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:25.7847904Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:25.7847996Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:25.7848081Z #define __USE_ATFILE 1
2025-05-07T20:26:25.7848177Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:25.7848275Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:25.7848364Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:25.7848597Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:25.7848692Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:25.7848791Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:25.7848897Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:25.7849008Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:25.7849101Z #define _STDLIB_H 1
2025-05-07T20:26:25.7849328Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:25.7849423Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.7849517Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:25.7849645Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.7849753Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:25.7849854Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:25.7850038Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:25.7850199Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:25.7850307Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:25.7850422Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:25.7850514Z #define __ldiv_t_defined 1
2025-05-07T20:26:25.7850693Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.7850786Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:25.7850961Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:25.7851070Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:25.7851162Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:25.7851266Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:25.7851368Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.7851466Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:25.7851553Z #define CUDART_CB 
2025-05-07T20:26:25.7851657Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:25.7851785Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:25.7851874Z #define MB_LEN_MAX 16
2025-05-07T20:26:25.7852097Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:25.7852198Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:25.7852328Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:25.7852445Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:25.7852550Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:25.7852809Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:25.7852941Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:25.7853029Z #define _GNU_SOURCE 1
2025-05-07T20:26:25.7853115Z #define __stub_putmsg 
2025-05-07T20:26:25.7853230Z #define __CUDACC__ 1
2025-05-07T20:26:25.7853365Z #define __N(msgid) (msgid)
2025-05-07T20:26:25.7853490Z #define __P(args) args
2025-05-07T20:26:25.7853860Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:25.7854011Z #define __cpp_init_captures 201304L
2025-05-07T20:26:25.7854160Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:25.7854259Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:25.7854358Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:25.7854439Z #define __WCHAR_T 
2025-05-07T20:26:25.7854534Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:25.7854630Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:25.7854751Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:25.7854868Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:25.7854879Z 
2025-05-07T20:26:25.7921698Z 
2025-05-07T20:26:25.7921848Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:25.7921854Z 
2025-05-07T20:26:27.6788527Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:27.6788921Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:27.6789237Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:27.6789560Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:27.6789901Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:27.6790107Z 
2025-05-07T20:26:27.7451463Z 
2025-05-07T20:26:27.7462990Z /usr/bin/nvidia-smi
2025-05-07T20:26:27.7468115Z + nvidia-smi
2025-05-07T20:26:27.7468448Z 
2025-05-07T20:26:27.7642953Z Wed May  7 20:26:27 2025       
2025-05-07T20:26:27.7643850Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:27.7644858Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:27.7645865Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:27.7649773Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:27.7650842Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:27.7651700Z |                                         |                        |               MIG M. |
2025-05-07T20:26:27.7652378Z |=========================================+========================+======================|
2025-05-07T20:26:27.7814181Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:27.7814742Z |  0%   28C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:27.7815139Z |                                         |                        |                  N/A |
2025-05-07T20:26:27.7815552Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:27.7817578Z                                                                                          
2025-05-07T20:26:27.7818111Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:27.7818722Z | Processes:                                                                              |
2025-05-07T20:26:27.7819275Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:27.7819699Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:27.7820056Z |=========================================================================================|
2025-05-07T20:26:27.7823264Z |  No running processes found                                                             |
2025-05-07T20:26:27.7823839Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.0367419Z 
2025-05-07T20:26:28.0372266Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:28.0421825Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:28.0422392Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:28.0435130Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:28.0435486Z env:
2025-05-07T20:26:28.0435721Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:28.0436022Z   BUILD_ENV: build_binary
2025-05-07T20:26:28.0436272Z   BUILD_TARGET: genai
2025-05-07T20:26:28.0436507Z   BUILD_VARIANT: cuda
2025-05-07T20:26:28.0436738Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:28.0437000Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:28.0437307Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:28.0437644Z ##[endgroup]
2025-05-07T20:26:28.3846382Z ################################################################################
2025-05-07T20:26:28.3846766Z # Install PyTorch (PIP)
2025-05-07T20:26:28.3847042Z #
2025-05-07T20:26:28.3861428Z # [2025-05-07T20:26:28.385Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:28.3861965Z ################################################################################
2025-05-07T20:26:28.3862190Z 
2025-05-07T20:26:28.3890243Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:29.3945397Z Channels:
2025-05-07T20:26:29.3945752Z  - conda-forge
2025-05-07T20:26:29.3945990Z Platform: linux-64
2025-05-07T20:26:32.7775816Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:33.4969737Z Solving environment: \ | / done
2025-05-07T20:26:33.7125997Z 
2025-05-07T20:26:33.7126496Z ## Package Plan ##
2025-05-07T20:26:33.7126709Z 
2025-05-07T20:26:33.7126927Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:33.7127237Z 
2025-05-07T20:26:33.7127347Z   added / updated specs:
2025-05-07T20:26:33.7128027Z     - numpy
2025-05-07T20:26:33.7128159Z 
2025-05-07T20:26:33.7128180Z 
2025-05-07T20:26:33.7128311Z The following packages will be downloaded:
2025-05-07T20:26:33.7128534Z 
2025-05-07T20:26:33.7128663Z     package                    |            build
2025-05-07T20:26:33.7128993Z     ---------------------------|-----------------
2025-05-07T20:26:33.7129383Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:33.7129851Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:33.7130310Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:33.7130764Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:33.7131227Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:33.7131708Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:33.7132169Z     numpy-2.0.2                |   py39h9cb892a_1         7.6 MB  conda-forge
2025-05-07T20:26:33.7132564Z     ------------------------------------------------------------
2025-05-07T20:26:33.7132909Z                                            Total:        14.8 MB
2025-05-07T20:26:33.7133122Z 
2025-05-07T20:26:33.7133256Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:33.7133480Z 
2025-05-07T20:26:33.7133717Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:33.7134219Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:33.7134736Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:33.7135298Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:33.7135815Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:33.7136368Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:33.7137128Z   numpy              conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 
2025-05-07T20:26:33.7137403Z 
2025-05-07T20:26:33.7137408Z 
2025-05-07T20:26:33.7137412Z 
2025-05-07T20:26:33.7137570Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:33.7137950Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:33.7138176Z 
2025-05-07T20:26:33.7138700Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:33.7138958Z 
2025-05-07T20:26:33.7146184Z 
2025-05-07T20:26:33.7151861Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:33.7152135Z 
2025-05-07T20:26:33.7152139Z 
2025-05-07T20:26:33.7155130Z 
2025-05-07T20:26:33.7174349Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:33.7174621Z 
2025-05-07T20:26:33.7174625Z 
2025-05-07T20:26:33.7174629Z 
2025-05-07T20:26:33.7175112Z 
2025-05-07T20:26:33.7192966Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:33.7193287Z 
2025-05-07T20:26:33.7193292Z 
2025-05-07T20:26:33.7193304Z 
2025-05-07T20:26:33.7193309Z 
2025-05-07T20:26:33.7195851Z 
2025-05-07T20:26:33.7198252Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:33.7198534Z 
2025-05-07T20:26:33.7198538Z 
2025-05-07T20:26:33.7198542Z 
2025-05-07T20:26:33.7198545Z 
2025-05-07T20:26:33.7198549Z 
2025-05-07T20:26:33.7198553Z 
2025-05-07T20:26:33.7830245Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:33.7830535Z 
2025-05-07T20:26:33.7830539Z 
2025-05-07T20:26:33.7830542Z 
2025-05-07T20:26:33.8589309Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:33.8589596Z 
2025-05-07T20:26:33.8589600Z 
2025-05-07T20:26:33.8589603Z 
2025-05-07T20:26:33.8589607Z 
2025-05-07T20:26:33.8599109Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:26:33.8599440Z 
2025-05-07T20:26:33.8599456Z 
2025-05-07T20:26:33.8599703Z 
2025-05-07T20:26:33.8599707Z 
2025-05-07T20:26:33.8599710Z 
2025-05-07T20:26:33.8657052Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:33.8657351Z 
2025-05-07T20:26:33.8657367Z 
2025-05-07T20:26:33.8657371Z 
2025-05-07T20:26:33.8657374Z 
2025-05-07T20:26:33.8666987Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:33.8667358Z 
2025-05-07T20:26:33.8667362Z 
2025-05-07T20:26:33.8667373Z 
2025-05-07T20:26:33.8667377Z 
2025-05-07T20:26:33.8667381Z 
2025-05-07T20:26:33.9861705Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:33.9869586Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:33.9869909Z 
2025-05-07T20:26:33.9869914Z 
2025-05-07T20:26:33.9869918Z 
2025-05-07T20:26:33.9869921Z 
2025-05-07T20:26:33.9869925Z 
2025-05-07T20:26:33.9869929Z 
2025-05-07T20:26:33.9889158Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:33.9889438Z 
2025-05-07T20:26:33.9889453Z 
2025-05-07T20:26:33.9889456Z 
2025-05-07T20:26:33.9889465Z 
2025-05-07T20:26:33.9889469Z 
2025-05-07T20:26:33.9901916Z 
2025-05-07T20:26:34.0437690Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.0439139Z 
2025-05-07T20:26:34.0499309Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.0499572Z 
2025-05-07T20:26:34.0499576Z 
2025-05-07T20:26:34.0499580Z 
2025-05-07T20:26:34.0499777Z 
2025-05-07T20:26:34.0560770Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.0561151Z 
2025-05-07T20:26:34.0561158Z 
2025-05-07T20:26:34.0561163Z 
2025-05-07T20:26:34.0561169Z 
2025-05-07T20:26:34.0562164Z 
2025-05-07T20:26:34.0585996Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.0586376Z 
2025-05-07T20:26:34.0586382Z 
2025-05-07T20:26:34.0807296Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:34.0807568Z 
2025-05-07T20:26:34.0807581Z 
2025-05-07T20:26:34.0810979Z 
2025-05-07T20:26:34.0827592Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.0827875Z 
2025-05-07T20:26:34.0827879Z 
2025-05-07T20:26:34.0828357Z 
2025-05-07T20:26:34.0861791Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.1180151Z numpy-2.0.2          | 7.6 MB    | ####7      |  47% 
2025-05-07T20:26:34.1180513Z 
2025-05-07T20:26:34.1182462Z 
2025-05-07T20:26:34.1251581Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.1251985Z 
2025-05-07T20:26:34.1251991Z 
2025-05-07T20:26:34.1251996Z 
2025-05-07T20:26:34.1252002Z 
2025-05-07T20:26:34.1252008Z 
2025-05-07T20:26:34.1252590Z 
2025-05-07T20:26:34.1438923Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.1440907Z 
2025-05-07T20:26:34.1610377Z libopenblas-0.3.29   | 5.6 MB    | #########1 |  92% [A
2025-05-07T20:26:34.1610867Z 
2025-05-07T20:26:34.1771233Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.1968148Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.1968380Z 
2025-05-07T20:26:34.1970243Z 
2025-05-07T20:26:34.1972973Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.1973239Z 
2025-05-07T20:26:34.1973466Z 
2025-05-07T20:26:34.3188178Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.3188477Z 
2025-05-07T20:26:34.6154648Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.6155121Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.6163430Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.6163933Z                                                      
2025-05-07T20:26:34.6164237Z 
2025-05-07T20:26:34.6164501Z                                                      [A
2025-05-07T20:26:34.6164754Z 
2025-05-07T20:26:34.6164758Z 
2025-05-07T20:26:34.6164927Z                                                      [A[A
2025-05-07T20:26:34.6165408Z 
2025-05-07T20:26:34.6165438Z 
2025-05-07T20:26:34.6165442Z 
2025-05-07T20:26:34.6165664Z                                                      [A[A[A
2025-05-07T20:26:34.6165973Z 
2025-05-07T20:26:34.6165979Z 
2025-05-07T20:26:34.6165985Z 
2025-05-07T20:26:34.6165990Z 
2025-05-07T20:26:34.6166184Z                                                      [A[A[A[A
2025-05-07T20:26:34.6166393Z 
2025-05-07T20:26:34.6166397Z 
2025-05-07T20:26:34.6166401Z 
2025-05-07T20:26:34.6166404Z 
2025-05-07T20:26:34.6166408Z 
2025-05-07T20:26:34.6166591Z                                                      [A[A[A[A[A
2025-05-07T20:26:34.6166805Z 
2025-05-07T20:26:34.6166809Z 
2025-05-07T20:26:34.6166813Z 
2025-05-07T20:26:34.6166816Z 
2025-05-07T20:26:34.6166820Z 
2025-05-07T20:26:34.6166829Z 
2025-05-07T20:26:34.6167025Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:34.7179060Z Preparing transaction: \ done
2025-05-07T20:26:34.9183392Z Verifying transaction: / - done
2025-05-07T20:26:35.0192857Z Executing transaction: | done
2025-05-07T20:26:35.1976430Z ################################################################################
2025-05-07T20:26:35.1976855Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:35.1977163Z #
2025-05-07T20:26:35.1992773Z # [2025-05-07T20:26:35.198Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:35.1993262Z ################################################################################
2025-05-07T20:26:35.1993488Z 
2025-05-07T20:26:35.2008354Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:35.3023529Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:35.3023973Z ################################################################################
2025-05-07T20:26:35.3024455Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:35.3024852Z #
2025-05-07T20:26:35.3041257Z # [2025-05-07T20:26:35.303Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:35.3042083Z ################################################################################
2025-05-07T20:26:35.3042320Z 
2025-05-07T20:26:35.3062957Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:35.3089578Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:35.3107023Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:35.3107565Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:35.3116292Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:35.3124899Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:35.3146599Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.1880009Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.1880526Z Collecting torch
2025-05-07T20:27:55.1881174Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:27:55.1881901Z Collecting filelock (from torch)
2025-05-07T20:27:55.1882429Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:27:55.1883385Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2)
2025-05-07T20:27:55.1884094Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:27:55.1884650Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:27:55.1885521Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.6 MB/s eta 0:00:00
2025-05-07T20:27:55.1885884Z Collecting networkx (from torch)
2025-05-07T20:27:55.1886763Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB)
2025-05-07T20:27:55.1887443Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1887783Z Collecting jinja2 (from torch)
2025-05-07T20:27:55.1888271Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:27:55.1888781Z Collecting fsspec (from torch)
2025-05-07T20:27:55.1889274Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:27:55.1889848Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1890567Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:27:55.1891352Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 56.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1891765Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1892524Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:27:55.1893310Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.8 MB/s eta 0:00:00
2025-05-07T20:27:55.1893710Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:27:55.1894418Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:27:55.1895200Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 36.9 MB/s eta 0:00:00
2025-05-07T20:27:55.1895585Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:27:55.1896260Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:27:55.1897032Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 36.3 MB/s eta 0:00:00
2025-05-07T20:27:55.1906178Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:27:55.1907266Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:27:55.1908182Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 51.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1908577Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:27:55.1909263Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:27:55.1910036Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 146.4 MB/s eta 0:00:00
2025-05-07T20:27:55.1910423Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:27:55.1911118Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:27:55.1911890Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 196.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1912316Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:27:55.1913015Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:27:55.1913785Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 150.2 MB/s eta 0:00:00
2025-05-07T20:27:55.1914175Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:27:55.1914885Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:27:55.1915649Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 129.1 MB/s eta 0:00:00
2025-05-07T20:27:55.1916046Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:27:55.1916741Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:27:55.1917533Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 163.1 MB/s eta 0:00:00
2025-05-07T20:27:55.1918639Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:27:55.1919404Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:27:55.1920177Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1920837Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:27:55.1921499Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:27:55.1922275Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:27:55.1923128Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 183.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1923512Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:27:55.1924318Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:27:55.1925127Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:27:55.1925947Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:27:55.1927191Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:27:55.1928036Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:27:55.1928593Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:27:55.1929227Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 48.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1929603Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:27:55.1930391Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:27:55.1931418Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl (825.5 MB)
2025-05-07T20:27:55.1932211Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.1 MB/s eta 0:00:00
2025-05-07T20:27:55.1932981Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:27:55.1933821Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.5 MB/s eta 0:00:00
2025-05-07T20:27:55.1934567Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:27:55.1935392Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 102.9 MB/s eta 0:00:00
2025-05-07T20:27:55.1936261Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:27:55.1937119Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.2 MB/s eta 0:00:00
2025-05-07T20:27:55.1938812Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:27:55.1940557Z 
2025-05-07T20:27:55.1942558Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:27:55.1944590Z 
2025-05-07T20:27:57.4187780Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:27:57.4190207Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:00.8253501Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:04.2763425Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:04.2763897Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:07.6175631Z True
2025-05-07T20:28:07.6175924Z True
2025-05-07T20:28:07.6176040Z 
2025-05-07T20:28:07.6837690Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:07.6874449Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:07.6875077Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:07.6886770Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:07.6887124Z env:
2025-05-07T20:28:07.6887351Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:07.6887651Z   BUILD_ENV: build_binary
2025-05-07T20:28:07.6887903Z   BUILD_TARGET: genai
2025-05-07T20:28:07.6888132Z   BUILD_VARIANT: cuda
2025-05-07T20:28:07.6888369Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:07.6888625Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:07.6888929Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:07.6889267Z ##[endgroup]
2025-05-07T20:28:08.0268310Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:08.0270787Z ################################################################################
2025-05-07T20:28:08.0271320Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:08.0271701Z #
2025-05-07T20:28:08.0288090Z # [2025-05-07T20:28:08.028Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:08.0288522Z ################################################################################
2025-05-07T20:28:08.0288741Z 
2025-05-07T20:28:08.0305196Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:08.1244672Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:08.1255473Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:08.1256099Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:08.1256511Z 
2025-05-07T20:28:08.2168446Z 
2025-05-07T20:28:08.2169003Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:08.2190219Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:14.2924858Z Collecting environment information...
2025-05-07T20:28:14.2925443Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:14.2925813Z Is debug build: False
2025-05-07T20:28:14.2926066Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:14.2926354Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:14.2926542Z 
2025-05-07T20:28:14.2926674Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:14.2927022Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:14.2927349Z Clang version: Could not collect
2025-05-07T20:28:14.2927634Z CMake version: Could not collect
2025-05-07T20:28:14.2927902Z Libc version: glibc-2.34
2025-05-07T20:28:14.2928105Z 
2025-05-07T20:28:14.2928532Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:14.2929304Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:14.2929916Z Is CUDA available: True
2025-05-07T20:28:14.2930268Z CUDA runtime version: 12.6.85
2025-05-07T20:28:14.2930646Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:14.2931019Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:14.2931354Z Nvidia driver version: 570.133.07
2025-05-07T20:28:14.2931643Z cuDNN version: Could not collect
2025-05-07T20:28:14.2931921Z HIP runtime version: N/A
2025-05-07T20:28:14.2932171Z MIOpen runtime version: N/A
2025-05-07T20:28:14.2932433Z Is XNNPACK available: True
2025-05-07T20:28:14.2932597Z 
2025-05-07T20:28:14.2932680Z CPU:
2025-05-07T20:28:14.2932896Z Architecture:                         x86_64
2025-05-07T20:28:14.2933241Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:14.2933640Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:14.2934041Z Byte Order:                           Little Endian
2025-05-07T20:28:14.2934359Z CPU(s):                               16
2025-05-07T20:28:14.2934676Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:14.2935416Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:14.2935770Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:14.2936100Z CPU family:                           23
2025-05-07T20:28:14.2936396Z Model:                                49
2025-05-07T20:28:14.2936689Z Thread(s) per core:                   2
2025-05-07T20:28:14.2936994Z Core(s) per socket:                   8
2025-05-07T20:28:14.2937284Z Socket(s):                            1
2025-05-07T20:28:14.2937569Z Stepping:                             0
2025-05-07T20:28:14.2937890Z BogoMIPS:                             5599.99
2025-05-07T20:28:14.2939987Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:14.2942488Z Hypervisor vendor:                    KVM
2025-05-07T20:28:14.2942807Z Virtualization type:                  full
2025-05-07T20:28:14.2943147Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.2943520Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.2943885Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:14.2944242Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:14.2944564Z NUMA node(s):                         1
2025-05-07T20:28:14.2944862Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:14.2945206Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:14.2945788Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:14.2946155Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:14.2946512Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:14.2946865Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:14.2947231Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:14.2947603Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:14.2948154Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:14.2948733Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:14.2949284Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:14.2949973Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:14.2950824Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:14.2951511Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:14.2951879Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:14.2952112Z 
2025-05-07T20:28:14.2952224Z Versions of relevant libraries:
2025-05-07T20:28:14.2952493Z [pip3] numpy==2.0.2
2025-05-07T20:28:14.2952744Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:14.2953060Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:14.2953376Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:14.2953699Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:14.2954021Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:14.2954312Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:14.2954613Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:14.2954920Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:14.2955236Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:14.2955714Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:14.2956027Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:14.2956326Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:14.2956653Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:14.2956979Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:14.2957288Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:14.2957665Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2958158Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2958676Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2959203Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2959741Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2960280Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2960781Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2961249Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:14.2961736Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.2962236Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.2962752Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2963215Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.2963680Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2964141Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2964621Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2965198Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:14.2965663Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.2966135Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.2966599Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2967067Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:14.2967538Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2968012Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:14.2968491Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.2968978Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.2969478Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2969967Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:14.2970453Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2970945Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.2971420Z [conda] numpy                     2.0.2            py39h9cb892a_1    conda-forge
2025-05-07T20:28:14.2971880Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:14.2972388Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:14.2972899Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2973409Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2973903Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:14.2974483Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:14.2974971Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:14.2975460Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:14.2975965Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:14.2976475Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:14.2976973Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:14.2977458Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:14.2977949Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2978436Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:14.2978904Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:14.2979193Z 
2025-05-07T20:28:14.3722478Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:14.3723168Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:14.3735122Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:14.3735478Z env:
2025-05-07T20:28:14.3735700Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:14.3736008Z   BUILD_ENV: build_binary
2025-05-07T20:28:14.3736255Z   BUILD_TARGET: genai
2025-05-07T20:28:14.3736485Z   BUILD_VARIANT: cuda
2025-05-07T20:28:14.3736719Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:14.3736980Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:14.3737289Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:14.3737623Z ##[endgroup]
2025-05-07T20:28:14.7143135Z ################################################################################
2025-05-07T20:28:14.7143533Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:14.7144131Z #
2025-05-07T20:28:14.7158560Z # [2025-05-07T20:28:14.715Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:14.7158970Z ################################################################################
2025-05-07T20:28:14.7159187Z 
2025-05-07T20:28:14.7175682Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:14.8053867Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:14.8075300Z [BUILD] Running git submodules update ...
2025-05-07T20:28:14.8096404Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:14.8458123Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:14.8458599Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:14.8459048Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:14.8459460Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:14.8459872Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:14.8460344Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:14.8460748Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:14.8493737Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:14.9041012Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:14.9062928Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:17.3041828Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:17.3154118Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:17.4176794Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:17.4215063Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:17.6792659Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:17.6836414Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:17.8034387Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:17.8070596Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:18.1743225Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:18.1782225Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:18.2405987Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:18.2410829Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:18.3284375Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:18.3319720Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:18.3810893Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2)
2025-05-07T20:28:18.4488719Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:18.4542077Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:18.5919083Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:18.5956578Z   Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:18.6981908Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:18.7070187Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:18.7611575Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:18.8319746Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:18.8369888Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:18.9394772Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:18.9431987Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:19.0602863Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:19.0701356Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:19.1873692Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.1908199Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:19.3002334Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.3039768Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:19.4593302Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.4632557Z   Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB)
2025-05-07T20:28:19.5751743Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.5790550Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.6927798Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.6961973Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.8156981Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.8192477Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:19.9209651Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.9243242Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.9861786Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:20.0374245Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.0409971Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:20.0913936Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:20.1447586Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:20.1481400Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:20.1981673Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:20.2790553Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14))
2025-05-07T20:28:20.2826435Z   Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
2025-05-07T20:28:20.3932802Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.3977228Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:20.4513682Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:20.5096863Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:20.5712964Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:21.1959716Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 44.6 MB/s eta 0:00:00
2025-05-07T20:28:21.1997463Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:21.2601291Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:21.3187170Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:21.3820421Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:21.4452316Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:21.5033758Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB)
2025-05-07T20:28:21.5647754Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 8.1 MB/s eta 0:00:00
2025-05-07T20:28:21.5713225Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:21.6383577Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:21.6950925Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:21.7552230Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:21.8150312Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:21.8753908Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:21.9271620Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB)
2025-05-07T20:28:21.9881617Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:22.0489604Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:22.1121050Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB)
2025-05-07T20:28:22.1680094Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:22.2276334Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:22.2879275Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:22.3482164Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:22.5957291Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build
2025-05-07T20:28:24.9792631Z 
2025-05-07T20:28:24.9868523Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0
2025-05-07T20:28:25.1765760Z ################################################################################
2025-05-07T20:28:25.1766316Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:25.1766670Z #
2025-05-07T20:28:25.1783479Z # [2025-05-07T20:28:25.178Z] + install_triton_pip build_binary
2025-05-07T20:28:25.1784018Z ################################################################################
2025-05-07T20:28:25.1784243Z 
2025-05-07T20:28:25.1784483Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:25.1785375Z ################################################################################
2025-05-07T20:28:25.1785748Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:25.1786081Z #
2025-05-07T20:28:25.1800433Z # [2025-05-07T20:28:25.179Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:25.1801096Z ################################################################################
2025-05-07T20:28:25.1801321Z 
2025-05-07T20:28:25.1816110Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:25.2715472Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:25.2716137Z ################################################################################
2025-05-07T20:28:25.2716520Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:25.2716801Z #
2025-05-07T20:28:25.2733298Z # [2025-05-07T20:28:25.273Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:25.2734031Z ################################################################################
2025-05-07T20:28:25.2734254Z 
2025-05-07T20:28:25.2783994Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:25.2799903Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:25.2800582Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:25.2808856Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:25.2818139Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:25.2839296Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:33.1472539Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:33.1473892Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:33.1474623Z 
2025-05-07T20:28:33.1474840Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:33.1475262Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:33.1476069Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:33.1477256Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB)
2025-05-07T20:28:33.1478343Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 51.5 MB/s eta 0:00:00
2025-05-07T20:28:33.1478738Z Installing collected packages: pytorch-triton
2025-05-07T20:28:33.1479107Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:33.1479504Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:33.1479934Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:33.1480368Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:33.1480814Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:33.1481088Z 
2025-05-07T20:28:35.3533060Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:35.3537154Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:37.5114612Z ################################################################################
2025-05-07T20:28:37.5115091Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:37.5117128Z ################################################################################
2025-05-07T20:28:37.5117364Z 
2025-05-07T20:28:39.5580146Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:41.7264467Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:41.7268018Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:41.7301944Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:41.7302447Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:41.7314248Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:41.7314601Z env:
2025-05-07T20:28:41.7314831Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:41.7315146Z   BUILD_ENV: build_binary
2025-05-07T20:28:41.7315429Z   BUILD_TARGET: genai
2025-05-07T20:28:41.7315659Z   BUILD_VARIANT: cuda
2025-05-07T20:28:41.7315895Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:41.7316147Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:41.7316452Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:41.7316792Z ##[endgroup]
2025-05-07T20:28:42.0698072Z ################################################################################
2025-05-07T20:28:42.0698858Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:42.0699126Z #
2025-05-07T20:28:42.0713224Z # [2025-05-07T20:28:42.070Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0713870Z ################################################################################
2025-05-07T20:28:42.0714092Z 
2025-05-07T20:28:42.0714450Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0715141Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0715478Z 
2025-05-07T20:28:42.0829962Z c3e6bfa6eadc59821953963216ffd62fb3371bf7  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0832508Z 
2025-05-07T20:28:42.0833248Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0833615Z 
2025-05-07T20:28:42.0964817Z b3c4041bb027a8c4ddf5b1fb266e05c307983525c8b62d76707c6e5028cede02  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0967028Z 
2025-05-07T20:28:42.0967336Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.0967686Z 
2025-05-07T20:28:42.1195681Z 139b82e5ceb5b9eb6e5607b06b0c5115  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:42.1198306Z 
2025-05-07T20:28:42.1207953Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:42.1229686Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.8592739Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.8594027Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2)
2025-05-07T20:28:44.8595026Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:44.8595487Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:44.8595763Z 
2025-05-07T20:28:51.6921489Z ################################################################################
2025-05-07T20:28:51.6921937Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:51.6922321Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:51.6922771Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:51.6923100Z [CHECK]
2025-05-07T20:28:51.6923425Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:51.6923939Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:51.6924346Z ################################################################################
2025-05-07T20:28:51.6924598Z 
2025-05-07T20:28:51.6924726Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:55.5972180Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:28:59.5106693Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.4298087Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.4302830Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:15.1581628Z ################################################################################
2025-05-07T20:29:15.1582089Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:15.1582447Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:15.1582804Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:15.1583161Z ################################################################################
2025-05-07T20:29:15.1583394Z 
2025-05-07T20:29:22.9743409Z ################################################################################
2025-05-07T20:29:22.9744284Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:22.9745674Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:22.9747239Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:22.9747778Z ################################################################################
2025-05-07T20:29:22.9748003Z 
2025-05-07T20:29:22.9748171Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:26.8687053Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:30.7621895Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:34.7984756Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:38.6902257Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:38.6906294Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:42.5194713Z fbgemm.nccl_init
2025-05-07T20:29:42.5194958Z 
2025-05-07T20:29:42.5821396Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:46.4194524Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:46.4194747Z 
2025-05-07T20:29:46.4823584Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:50.3115653Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:50.3115888Z 
2025-05-07T20:29:50.3756189Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:50.3756810Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:50.3794649Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:50.3795129Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:50.3808997Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:50.3809360Z env:
2025-05-07T20:29:50.3809592Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:50.3809892Z   BUILD_ENV: build_binary
2025-05-07T20:29:50.3810145Z   BUILD_TARGET: genai
2025-05-07T20:29:50.3810380Z   BUILD_VARIANT: cuda
2025-05-07T20:29:50.3810615Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:50.3810879Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:50.3811185Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:50.3811528Z ##[endgroup]
2025-05-07T20:29:50.7184713Z ################################################################################
2025-05-07T20:29:50.7185237Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:50.7185607Z #
2025-05-07T20:29:50.7200271Z # [2025-05-07T20:29:50.719Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:50.7200860Z ################################################################################
2025-05-07T20:29:50.7201192Z 
2025-05-07T20:29:58.5745512Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:29:58.5746473Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:29:58.5746913Z [TEST] Determined the test directories:
2025-05-07T20:29:58.5747241Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:29:58.5747556Z fbgemm_gpu/experimental/example/test
2025-05-07T20:29:58.5747858Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:29:58.5748055Z 
2025-05-07T20:29:58.5756140Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:29:58.5763270Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:29:58.5763713Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:29:58.5764423Z 
2025-05-07T20:29:59.0029992Z 
2025-05-07T20:29:59.0030275Z [TEST] Installing PyTest ...
2025-05-07T20:29:59.0053822Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:00.1057897Z Channels:
2025-05-07T20:30:00.1058189Z  - conda-forge
2025-05-07T20:30:00.1058446Z Platform: linux-64
2025-05-07T20:30:03.3976690Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:04.5514619Z Solving environment: \ | / done
2025-05-07T20:30:04.7776336Z 
2025-05-07T20:30:04.7776771Z ## Package Plan ##
2025-05-07T20:30:04.7777013Z 
2025-05-07T20:30:04.7777302Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:04.7777764Z 
2025-05-07T20:30:04.7777897Z   added / updated specs:
2025-05-07T20:30:04.7778239Z     - expecttest
2025-05-07T20:30:04.7778520Z     - pytest
2025-05-07T20:30:04.7778688Z 
2025-05-07T20:30:04.7778694Z 
2025-05-07T20:30:04.7778853Z The following packages will be downloaded:
2025-05-07T20:30:04.7779155Z 
2025-05-07T20:30:04.7779278Z     package                    |            build
2025-05-07T20:30:04.7779604Z     ---------------------------|-----------------
2025-05-07T20:30:04.7779982Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:04.7780450Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:04.7780918Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:04.7781448Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:04.7781886Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:04.7782314Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:04.7782732Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:04.7783474Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:04.7783882Z     ------------------------------------------------------------
2025-05-07T20:30:04.7784230Z                                            Total:         428 KB
2025-05-07T20:30:04.7784440Z 
2025-05-07T20:30:04.7784574Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:04.7784795Z 
2025-05-07T20:30:04.7784997Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:04.7785507Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:04.7786033Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:04.7786515Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.7786986Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:04.7787442Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.7787884Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:04.7788312Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:04.7788575Z 
2025-05-07T20:30:04.7788580Z 
2025-05-07T20:30:04.7788584Z 
2025-05-07T20:30:04.7788731Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:04.7789118Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:04.7789342Z 
2025-05-07T20:30:04.7789753Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:04.7789989Z 
2025-05-07T20:30:04.7789993Z 
2025-05-07T20:30:04.7795707Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:04.7796015Z 
2025-05-07T20:30:04.7796021Z 
2025-05-07T20:30:04.7796026Z 
2025-05-07T20:30:04.7807263Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:04.7807604Z 
2025-05-07T20:30:04.7807608Z 
2025-05-07T20:30:04.7807611Z 
2025-05-07T20:30:04.7807615Z 
2025-05-07T20:30:04.7822961Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:04.7823678Z 
2025-05-07T20:30:04.7823685Z 
2025-05-07T20:30:04.7823690Z 
2025-05-07T20:30:04.7823696Z 
2025-05-07T20:30:04.7823701Z 
2025-05-07T20:30:04.7824348Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:04.7824604Z 
2025-05-07T20:30:04.7824608Z 
2025-05-07T20:30:04.7824611Z 
2025-05-07T20:30:04.7824615Z 
2025-05-07T20:30:04.7824619Z 
2025-05-07T20:30:04.7824622Z 
2025-05-07T20:30:04.7826513Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:04.7826890Z 
2025-05-07T20:30:04.7826895Z 
2025-05-07T20:30:04.7826898Z 
2025-05-07T20:30:04.7826902Z 
2025-05-07T20:30:04.7826905Z 
2025-05-07T20:30:04.7826909Z 
2025-05-07T20:30:04.7826913Z 
2025-05-07T20:30:05.0211313Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:05.0211757Z 
2025-05-07T20:30:05.0211764Z 
2025-05-07T20:30:05.0211769Z 
2025-05-07T20:30:05.0216576Z 
2025-05-07T20:30:05.0224629Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:05.0225047Z 
2025-05-07T20:30:05.0225053Z 
2025-05-07T20:30:05.0225058Z 
2025-05-07T20:30:05.0225064Z 
2025-05-07T20:30:05.0344008Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:05.0344430Z 
2025-05-07T20:30:05.0344436Z 
2025-05-07T20:30:05.0344441Z 
2025-05-07T20:30:05.0344446Z 
2025-05-07T20:30:05.0346368Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:05.0346776Z 
2025-05-07T20:30:05.0346782Z 
2025-05-07T20:30:05.0352235Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:05.0352608Z 
2025-05-07T20:30:05.0352614Z 
2025-05-07T20:30:05.0456434Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:05.0456792Z 
2025-05-07T20:30:05.0456797Z 
2025-05-07T20:30:05.0579432Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:05.0579709Z 
2025-05-07T20:30:05.0579812Z 
2025-05-07T20:30:05.0579818Z 
2025-05-07T20:30:05.0579833Z 
2025-05-07T20:30:05.0580057Z 
2025-05-07T20:30:05.0584319Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:05.0584592Z 
2025-05-07T20:30:05.0584596Z 
2025-05-07T20:30:05.0584600Z 
2025-05-07T20:30:05.0584604Z 
2025-05-07T20:30:05.0584607Z 
2025-05-07T20:30:05.0668396Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:05.0668674Z 
2025-05-07T20:30:05.0668678Z 
2025-05-07T20:30:05.0668682Z 
2025-05-07T20:30:05.0668685Z 
2025-05-07T20:30:05.0668689Z 
2025-05-07T20:30:05.0773130Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:05.0773397Z 
2025-05-07T20:30:05.0773401Z 
2025-05-07T20:30:05.0773405Z 
2025-05-07T20:30:05.0777165Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:05.0777431Z 
2025-05-07T20:30:05.0777435Z 
2025-05-07T20:30:05.0777439Z 
2025-05-07T20:30:05.0859163Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:05.0859431Z 
2025-05-07T20:30:05.0859442Z 
2025-05-07T20:30:05.0859453Z 
2025-05-07T20:30:05.0870933Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:05.0871188Z 
2025-05-07T20:30:05.0871191Z 
2025-05-07T20:30:05.0871195Z 
2025-05-07T20:30:05.0871199Z 
2025-05-07T20:30:05.0871211Z 
2025-05-07T20:30:05.0871215Z 
2025-05-07T20:30:05.0874652Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:05.0874933Z 
2025-05-07T20:30:05.0874937Z 
2025-05-07T20:30:05.0874940Z 
2025-05-07T20:30:05.0874954Z 
2025-05-07T20:30:05.0874958Z 
2025-05-07T20:30:05.0874962Z 
2025-05-07T20:30:05.0921911Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:05.0953394Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:05.0953992Z 
2025-05-07T20:30:05.0955460Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:05.0955712Z 
2025-05-07T20:30:05.0955716Z 
2025-05-07T20:30:05.0955720Z 
2025-05-07T20:30:05.0955954Z 
2025-05-07T20:30:05.0955964Z 
2025-05-07T20:30:05.0955967Z 
2025-05-07T20:30:05.0963276Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:05.0963661Z 
2025-05-07T20:30:05.0976106Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:05.0976435Z 
2025-05-07T20:30:05.0976439Z 
2025-05-07T20:30:05.0976443Z 
2025-05-07T20:30:05.0976446Z 
2025-05-07T20:30:05.0976450Z 
2025-05-07T20:30:05.0976454Z 
2025-05-07T20:30:05.0976458Z 
2025-05-07T20:30:05.0984141Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:05.0984532Z 
2025-05-07T20:30:05.0984538Z 
2025-05-07T20:30:05.0984544Z 
2025-05-07T20:30:05.0984549Z 
2025-05-07T20:30:05.0984554Z 
2025-05-07T20:30:05.0984559Z 
2025-05-07T20:30:05.0988987Z 
2025-05-07T20:30:05.1018685Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:05.1236081Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:05.1236461Z 
2025-05-07T20:30:05.1236475Z 
2025-05-07T20:30:05.1236481Z 
2025-05-07T20:30:05.1236487Z 
2025-05-07T20:30:05.1236493Z 
2025-05-07T20:30:05.1236498Z 
2025-05-07T20:30:05.1236503Z 
2025-05-07T20:30:05.1279021Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:05.1279315Z 
2025-05-07T20:30:05.1434798Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:05.1441775Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:05.1442113Z                                                      
2025-05-07T20:30:05.1442336Z 
2025-05-07T20:30:05.1442558Z                                                      [A
2025-05-07T20:30:05.1442878Z 
2025-05-07T20:30:05.1442884Z 
2025-05-07T20:30:05.1443123Z                                                      [A[A
2025-05-07T20:30:05.1443337Z 
2025-05-07T20:30:05.1443341Z 
2025-05-07T20:30:05.1443345Z 
2025-05-07T20:30:05.1443523Z                                                      [A[A[A
2025-05-07T20:30:05.1443936Z 
2025-05-07T20:30:05.1443942Z 
2025-05-07T20:30:05.1443946Z 
2025-05-07T20:30:05.1443950Z 
2025-05-07T20:30:05.1444137Z                                                      [A[A[A[A
2025-05-07T20:30:05.1444352Z 
2025-05-07T20:30:05.1444356Z 
2025-05-07T20:30:05.1444360Z 
2025-05-07T20:30:05.1444364Z 
2025-05-07T20:30:05.1444367Z 
2025-05-07T20:30:05.1444561Z                                                      [A[A[A[A[A
2025-05-07T20:30:05.1444774Z 
2025-05-07T20:30:05.1444777Z 
2025-05-07T20:30:05.1444781Z 
2025-05-07T20:30:05.1444785Z 
2025-05-07T20:30:05.1444788Z 
2025-05-07T20:30:05.1444792Z 
2025-05-07T20:30:05.1444982Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:05.1445201Z 
2025-05-07T20:30:05.1445205Z 
2025-05-07T20:30:05.1445209Z 
2025-05-07T20:30:05.1445212Z 
2025-05-07T20:30:05.1445216Z 
2025-05-07T20:30:05.1445220Z 
2025-05-07T20:30:05.1445224Z 
2025-05-07T20:30:05.1445423Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:05.2455902Z Preparing transaction: \ done
2025-05-07T20:30:05.3460568Z Verifying transaction: / done
2025-05-07T20:30:07.2488020Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:07.3848533Z [TEST] Checking imports ...
2025-05-07T20:30:11.2829217Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:11.2841691Z [TEST] Setting feature flags ...
2025-05-07T20:30:11.2842282Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:11.2842722Z 
2025-05-07T20:30:11.7069701Z 
2025-05-07T20:30:11.7070473Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:11.7071369Z ################################################################################
2025-05-07T20:30:11.7071808Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:11.7072069Z #
2025-05-07T20:30:11.7092519Z # [2025-05-07T20:30:11.708Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:11.7093412Z ################################################################################
2025-05-07T20:30:11.7093656Z 
2025-05-07T20:30:11.7100280Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:11.7129079Z ./attention/gqa_test.py
2025-05-07T20:30:11.7129457Z ./coalesce/coalesce_test.py
2025-05-07T20:30:11.7129846Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:11.7130237Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:11.7130584Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:11.7130847Z ./moe/activation_test.py
2025-05-07T20:30:11.7131098Z ./moe/gather_scatter_test.py
2025-05-07T20:30:11.7131355Z ./moe/layers_test.py
2025-05-07T20:30:11.7131590Z ./moe/shuffling_test.py
2025-05-07T20:30:11.7131833Z ./quantize/quantize_test.py
2025-05-07T20:30:11.7132004Z 
2025-05-07T20:30:11.7132122Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:11.7132342Z 
2025-05-07T20:30:11.7150037Z ################################################################################
2025-05-07T20:30:11.7164471Z # [2025-05-07T20:30:11.716Z] Run Python Test Suite:
2025-05-07T20:30:11.7164931Z #   ./attention/gqa_test.py
2025-05-07T20:30:11.7165352Z ################################################################################
2025-05-07T20:30:11.7188948Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:11.7189558Z 
2025-05-07T20:30:14.2649184Z ============================= test session starts ==============================
2025-05-07T20:30:14.2649819Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:14.2650347Z cachedir: .pytest_cache
2025-05-07T20:30:14.2650928Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:14.2651915Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:14.2652367Z plugins: hypothesis-6.131.14
2025-05-07T20:30:15.7902948Z collecting ... collected 2 items
2025-05-07T20:30:15.7903191Z 
2025-05-07T20:30:51.8656150Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:51.8656860Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8657275Z     int4_kv=False,
2025-05-07T20:30:51.8657541Z     num_groups=1,
2025-05-07T20:30:51.8657801Z     B=1,
2025-05-07T20:30:51.8658038Z     MAX_T=4,
2025-05-07T20:30:51.8658277Z     N_H_L=1,
2025-05-07T20:30:51.8658521Z )
2025-05-07T20:30:51.8658767Z Trying example: test_gqa(
2025-05-07T20:30:51.8659142Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8659537Z     int4_kv=True,
2025-05-07T20:30:51.8659794Z     num_groups=1,
2025-05-07T20:30:51.8660056Z     B=1,
2025-05-07T20:30:51.8660290Z     MAX_T=4,
2025-05-07T20:30:51.8660531Z     N_H_L=1,
2025-05-07T20:30:51.8660774Z )
2025-05-07T20:30:51.8661063Z Trying example: test_gqa(
2025-05-07T20:30:51.8661574Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8661966Z     int4_kv=True,
2025-05-07T20:30:51.8662242Z     num_groups=4,
2025-05-07T20:30:51.8662499Z     B=23,
2025-05-07T20:30:51.8662741Z     MAX_T=33,
2025-05-07T20:30:51.8662986Z     N_H_L=68,
2025-05-07T20:30:51.8663220Z )
2025-05-07T20:30:51.8663465Z Trying example: test_gqa(
2025-05-07T20:30:51.8663833Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8664228Z     int4_kv=True,
2025-05-07T20:30:51.8664486Z     num_groups=4,
2025-05-07T20:30:51.8664748Z     B=77,
2025-05-07T20:30:51.8664984Z     MAX_T=4,
2025-05-07T20:30:51.8665220Z     N_H_L=1,
2025-05-07T20:30:51.8665459Z )
2025-05-07T20:30:51.8665703Z Trying example: test_gqa(
2025-05-07T20:30:51.8666107Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8666499Z     int4_kv=True,
2025-05-07T20:30:51.8666765Z     num_groups=4,
2025-05-07T20:30:51.8667468Z     B=77,
2025-05-07T20:30:51.8667711Z     MAX_T=52,
2025-05-07T20:30:51.8667958Z     N_H_L=67,
2025-05-07T20:30:51.8668194Z )
2025-05-07T20:30:51.8668439Z Trying example: test_gqa(
2025-05-07T20:30:51.8668800Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8669187Z     int4_kv=False,
2025-05-07T20:30:51.8669461Z     num_groups=4,
2025-05-07T20:30:51.8669717Z     B=57,
2025-05-07T20:30:51.8669951Z     MAX_T=45,
2025-05-07T20:30:51.8670201Z     N_H_L=120,
2025-05-07T20:30:51.8670450Z )
2025-05-07T20:30:51.8670686Z Trying example: test_gqa(
2025-05-07T20:30:51.8671050Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8671438Z     int4_kv=True,
2025-05-07T20:30:51.8671698Z     num_groups=4,
2025-05-07T20:30:51.8671958Z     B=52,
2025-05-07T20:30:51.8672192Z     MAX_T=42,
2025-05-07T20:30:51.8672432Z     N_H_L=53,
2025-05-07T20:30:51.8672676Z )
2025-05-07T20:30:51.8672918Z Trying example: test_gqa(
2025-05-07T20:30:51.8673271Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8673685Z     int4_kv=True,
2025-05-07T20:30:51.8673951Z     num_groups=1,
2025-05-07T20:30:51.8674231Z     B=77,
2025-05-07T20:30:51.8674470Z     MAX_T=95,
2025-05-07T20:30:51.8674709Z     N_H_L=53,
2025-05-07T20:30:51.8674956Z )
2025-05-07T20:30:51.8675202Z Trying example: test_gqa(
2025-05-07T20:30:51.8675556Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8675952Z     int4_kv=True,
2025-05-07T20:30:51.8676248Z     num_groups=4,
2025-05-07T20:30:51.8676529Z     B=113,
2025-05-07T20:30:51.8676769Z     MAX_T=48,
2025-05-07T20:30:51.8677021Z     N_H_L=96,
2025-05-07T20:30:51.8677253Z )
2025-05-07T20:30:51.8677496Z Trying example: test_gqa(
2025-05-07T20:30:51.8677859Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8678243Z     int4_kv=False,
2025-05-07T20:30:51.8678514Z     num_groups=1,
2025-05-07T20:30:51.8678773Z     B=51,
2025-05-07T20:30:51.8689530Z     MAX_T=61,
2025-05-07T20:30:51.8689822Z     N_H_L=69,
2025-05-07T20:30:51.8690318Z )
2025-05-07T20:30:51.8690581Z Trying example: test_gqa(
2025-05-07T20:30:51.8690953Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8691358Z     int4_kv=False,
2025-05-07T20:30:51.8691628Z     num_groups=4,
2025-05-07T20:30:51.8691885Z     B=17,
2025-05-07T20:30:51.8692127Z     MAX_T=113,
2025-05-07T20:30:51.8692387Z     N_H_L=65,
2025-05-07T20:30:51.8692623Z )
2025-05-07T20:30:51.8692869Z Trying example: test_gqa(
2025-05-07T20:30:51.8693239Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8693628Z     int4_kv=False,
2025-05-07T20:30:51.8693897Z     num_groups=4,
2025-05-07T20:30:51.8694158Z     B=17,
2025-05-07T20:30:51.8694389Z     MAX_T=65,
2025-05-07T20:30:51.8694635Z     N_H_L=65,
2025-05-07T20:30:51.8694876Z )
2025-05-07T20:30:51.8695115Z Trying example: test_gqa(
2025-05-07T20:30:51.8695486Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8695886Z     int4_kv=False,
2025-05-07T20:30:51.8696157Z     num_groups=4,
2025-05-07T20:30:51.8696426Z     B=65,
2025-05-07T20:30:51.8696666Z     MAX_T=65,
2025-05-07T20:30:51.8696915Z     N_H_L=65,
2025-05-07T20:30:51.8697154Z )
2025-05-07T20:30:51.8697399Z Trying example: test_gqa(
2025-05-07T20:30:51.8697766Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8698154Z     int4_kv=False,
2025-05-07T20:30:51.8698423Z     num_groups=1,
2025-05-07T20:30:51.8698686Z     B=6,
2025-05-07T20:30:51.8698916Z     MAX_T=108,
2025-05-07T20:30:51.8699167Z     N_H_L=14,
2025-05-07T20:30:51.8699409Z )
2025-05-07T20:30:51.8699644Z Trying example: test_gqa(
2025-05-07T20:30:51.8700012Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8700403Z     int4_kv=False,
2025-05-07T20:30:51.8700666Z     num_groups=1,
2025-05-07T20:30:51.8700926Z     B=6,
2025-05-07T20:30:51.8701161Z     MAX_T=14,
2025-05-07T20:30:51.8701498Z     N_H_L=14,
2025-05-07T20:30:51.8701744Z )
2025-05-07T20:30:51.8701986Z Trying example: test_gqa(
2025-05-07T20:30:51.8702458Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8702855Z     int4_kv=False,
2025-05-07T20:30:51.8703128Z     num_groups=1,
2025-05-07T20:30:51.8703382Z     B=6,
2025-05-07T20:30:51.8703621Z     MAX_T=6,
2025-05-07T20:30:51.8703869Z     N_H_L=14,
2025-05-07T20:30:51.8704103Z )
2025-05-07T20:30:51.8704349Z Trying example: test_gqa(
2025-05-07T20:30:51.8704751Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8705205Z     int4_kv=False,
2025-05-07T20:30:51.8705506Z     num_groups=1,
2025-05-07T20:30:51.8705800Z     B=6,
2025-05-07T20:30:51.8706086Z     MAX_T=6,
2025-05-07T20:30:51.8706336Z     N_H_L=6,
2025-05-07T20:30:51.8706590Z )
2025-05-07T20:30:51.8706841Z Trying example: test_gqa(
2025-05-07T20:30:51.8707136Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8707462Z     int4_kv=False,
2025-05-07T20:30:51.8707683Z     num_groups=1,
2025-05-07T20:30:51.8707888Z     B=70,
2025-05-07T20:30:51.8708082Z     MAX_T=94,
2025-05-07T20:30:51.8708292Z     N_H_L=78,
2025-05-07T20:30:51.8708487Z )
2025-05-07T20:30:51.8708687Z Trying example: test_gqa(
2025-05-07T20:30:51.8708992Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8709310Z     int4_kv=False,
2025-05-07T20:30:51.8709527Z     num_groups=1,
2025-05-07T20:30:51.8709736Z     B=78,
2025-05-07T20:30:51.8709924Z     MAX_T=94,
2025-05-07T20:30:51.8710128Z     N_H_L=78,
2025-05-07T20:30:51.8710319Z )
2025-05-07T20:30:51.8710508Z Trying example: test_gqa(
2025-05-07T20:30:51.8710807Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8711126Z     int4_kv=False,
2025-05-07T20:30:51.8711341Z     num_groups=1,
2025-05-07T20:30:51.8711552Z     B=94,
2025-05-07T20:30:51.8711749Z     MAX_T=94,
2025-05-07T20:30:51.8711941Z     N_H_L=78,
2025-05-07T20:30:51.8712139Z )
2025-05-07T20:30:51.8712338Z Trying example: test_gqa(
2025-05-07T20:30:51.8712631Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8712950Z     int4_kv=False,
2025-05-07T20:30:51.8713277Z     num_groups=1,
2025-05-07T20:30:51.8713499Z     B=94,
2025-05-07T20:30:51.8713684Z     MAX_T=94,
2025-05-07T20:30:51.8713884Z     N_H_L=94,
2025-05-07T20:30:51.8714077Z )
2025-05-07T20:30:51.8714265Z Trying example: test_gqa(
2025-05-07T20:30:51.8714563Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8714882Z     int4_kv=False,
2025-05-07T20:30:51.8715091Z     num_groups=4,
2025-05-07T20:30:51.8715302Z     B=41,
2025-05-07T20:30:51.8715496Z     MAX_T=105,
2025-05-07T20:30:51.8715696Z     N_H_L=126,
2025-05-07T20:30:51.8715927Z )
2025-05-07T20:30:51.8716148Z Trying example: test_gqa(
2025-05-07T20:30:51.8716445Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8716767Z     int4_kv=False,
2025-05-07T20:30:51.8716984Z     num_groups=4,
2025-05-07T20:30:51.8717188Z     B=105,
2025-05-07T20:30:51.8717383Z     MAX_T=105,
2025-05-07T20:30:51.8717589Z     N_H_L=126,
2025-05-07T20:30:51.8717783Z )
2025-05-07T20:30:51.8717983Z Trying example: test_gqa(
2025-05-07T20:30:51.8718296Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8718611Z     int4_kv=False,
2025-05-07T20:30:51.8718829Z     num_groups=4,
2025-05-07T20:30:51.8719034Z     B=105,
2025-05-07T20:30:51.8719225Z     MAX_T=105,
2025-05-07T20:30:51.8719421Z     N_H_L=105,
2025-05-07T20:30:51.8719617Z )
2025-05-07T20:30:51.8719812Z Trying example: test_gqa(
2025-05-07T20:30:51.8720101Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8720416Z     int4_kv=True,
2025-05-07T20:30:51.8720630Z     num_groups=1,
2025-05-07T20:30:51.8720831Z     B=95,
2025-05-07T20:30:51.8721020Z     MAX_T=114,
2025-05-07T20:30:51.8721222Z     N_H_L=43,
2025-05-07T20:30:51.8721409Z )
2025-05-07T20:30:51.8721603Z Trying example: test_gqa(
2025-05-07T20:30:51.8721897Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8722207Z     int4_kv=True,
2025-05-07T20:30:51.8722418Z     num_groups=1,
2025-05-07T20:30:51.8722625Z     B=43,
2025-05-07T20:30:51.8722900Z     MAX_T=114,
2025-05-07T20:30:51.8723109Z     N_H_L=43,
2025-05-07T20:30:51.8723302Z )
2025-05-07T20:30:51.8723490Z Trying example: test_gqa(
2025-05-07T20:30:51.8723786Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8724103Z     int4_kv=True,
2025-05-07T20:30:51.8724316Z     num_groups=1,
2025-05-07T20:30:51.8724520Z     B=43,
2025-05-07T20:30:51.8724711Z     MAX_T=43,
2025-05-07T20:30:51.8724912Z     N_H_L=43,
2025-05-07T20:30:51.8725101Z )
2025-05-07T20:30:51.8725297Z Trying example: test_gqa(
2025-05-07T20:30:51.8725596Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8725970Z     int4_kv=False,
2025-05-07T20:30:51.8726199Z     num_groups=1,
2025-05-07T20:30:51.8726445Z     B=21,
2025-05-07T20:30:51.8726633Z     MAX_T=38,
2025-05-07T20:30:51.8726836Z     N_H_L=42,
2025-05-07T20:30:51.8727033Z )
2025-05-07T20:30:51.8727225Z Trying example: test_gqa(
2025-05-07T20:30:51.8727525Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8727858Z     int4_kv=False,
2025-05-07T20:30:51.8728069Z     num_groups=1,
2025-05-07T20:30:51.8728280Z     B=38,
2025-05-07T20:30:51.8728472Z     MAX_T=38,
2025-05-07T20:30:51.8728662Z     N_H_L=42,
2025-05-07T20:30:51.8728852Z )
2025-05-07T20:30:51.8729045Z Trying example: test_gqa(
2025-05-07T20:30:51.8729334Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8729648Z     int4_kv=False,
2025-05-07T20:30:51.8729861Z     num_groups=1,
2025-05-07T20:30:51.8730063Z     B=38,
2025-05-07T20:30:51.8730251Z     MAX_T=42,
2025-05-07T20:30:51.8730449Z     N_H_L=42,
2025-05-07T20:30:51.8730636Z )
2025-05-07T20:30:51.8730830Z Trying example: test_gqa(
2025-05-07T20:30:51.8731126Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8731442Z     int4_kv=False,
2025-05-07T20:30:51.8731651Z     num_groups=1,
2025-05-07T20:30:51.8731856Z     B=42,
2025-05-07T20:30:51.8732053Z     MAX_T=42,
2025-05-07T20:30:51.8732244Z     N_H_L=42,
2025-05-07T20:30:51.8732436Z )
2025-05-07T20:30:51.8732728Z Trying example: test_gqa(
2025-05-07T20:30:51.8733027Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8733343Z     int4_kv=True,
2025-05-07T20:30:51.8733557Z     num_groups=1,
2025-05-07T20:30:51.8733760Z     B=74,
2025-05-07T20:30:51.8733948Z     MAX_T=20,
2025-05-07T20:30:51.8734145Z     N_H_L=15,
2025-05-07T20:30:51.8734332Z )
2025-05-07T20:30:51.8734524Z Trying example: test_gqa(
2025-05-07T20:30:51.8734821Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8735130Z     int4_kv=True,
2025-05-07T20:30:51.8735345Z     num_groups=1,
2025-05-07T20:30:51.8735555Z     B=20,
2025-05-07T20:30:51.8735743Z     MAX_T=20,
2025-05-07T20:30:51.8735941Z     N_H_L=15,
2025-05-07T20:30:51.8736134Z )
2025-05-07T20:30:51.8736323Z Trying example: test_gqa(
2025-05-07T20:30:51.8736618Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8736934Z     int4_kv=True,
2025-05-07T20:30:51.8737142Z     num_groups=1,
2025-05-07T20:30:51.8737357Z     B=20,
2025-05-07T20:30:51.8737561Z     MAX_T=15,
2025-05-07T20:30:51.8737755Z     N_H_L=15,
2025-05-07T20:30:51.8737949Z )
2025-05-07T20:30:51.8738150Z Trying example: test_gqa(
2025-05-07T20:30:51.8738449Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8738759Z     int4_kv=True,
2025-05-07T20:30:51.8738975Z     num_groups=1,
2025-05-07T20:30:51.8739189Z     B=15,
2025-05-07T20:30:51.8739374Z     MAX_T=20,
2025-05-07T20:30:51.8739571Z     N_H_L=15,
2025-05-07T20:30:51.8739763Z )
2025-05-07T20:30:51.8739950Z Trying example: test_gqa(
2025-05-07T20:30:51.8740561Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8741033Z     int4_kv=True,
2025-05-07T20:30:51.8741307Z     num_groups=1,
2025-05-07T20:30:51.8741517Z     B=15,
2025-05-07T20:30:51.8741705Z     MAX_T=15,
2025-05-07T20:30:51.8741894Z     N_H_L=15,
2025-05-07T20:30:51.8742086Z )
2025-05-07T20:30:51.8742286Z Trying example: test_gqa(
2025-05-07T20:30:51.8742578Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8743082Z     int4_kv=False,
2025-05-07T20:30:51.8743296Z     num_groups=4,
2025-05-07T20:30:51.8743498Z     B=117,
2025-05-07T20:30:51.8743691Z     MAX_T=104,
2025-05-07T20:30:51.8743890Z     N_H_L=69,
2025-05-07T20:30:51.8744079Z )
2025-05-07T20:30:51.8744277Z Trying example: test_gqa(
2025-05-07T20:30:51.8744572Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8744882Z     int4_kv=False,
2025-05-07T20:30:51.8745095Z     num_groups=4,
2025-05-07T20:30:51.8745304Z     B=117,
2025-05-07T20:30:51.8745489Z     MAX_T=117,
2025-05-07T20:30:51.8745689Z     N_H_L=69,
2025-05-07T20:30:51.8745884Z )
2025-05-07T20:30:51.8746071Z Trying example: test_gqa(
2025-05-07T20:30:51.8746390Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8746734Z     int4_kv=False,
2025-05-07T20:30:51.8746958Z     num_groups=4,
2025-05-07T20:30:51.8747160Z     B=69,
2025-05-07T20:30:51.8747351Z     MAX_T=117,
2025-05-07T20:30:51.8747553Z     N_H_L=69,
2025-05-07T20:30:51.8747749Z )
2025-05-07T20:30:51.8747954Z Trying example: test_gqa(
2025-05-07T20:30:51.8748257Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.8748568Z     int4_kv=False,
2025-05-07T20:30:51.8748783Z     num_groups=4,
2025-05-07T20:30:51.8748992Z     B=117,
2025-05-07T20:30:51.8749176Z     MAX_T=69,
2025-05-07T20:30:51.8749374Z     N_H_L=69,
2025-05-07T20:30:51.8749571Z )
2025-05-07T20:30:51.8749761Z PASSED
2025-05-07T20:30:51.9108230Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:51.9108584Z 
2025-05-07T20:30:51.9108738Z =========================== short test summary info ============================
2025-05-07T20:30:51.9109468Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:51.9110166Z ======================== 1 passed, 1 skipped in 38.16s =========================
2025-05-07T20:30:52.5483716Z 
2025-05-07T20:30:52.5484926Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:52.5505251Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds
2025-05-07T20:30:52.5505674Z 
2025-05-07T20:30:52.5505680Z 
2025-05-07T20:30:52.5505685Z 
2025-05-07T20:30:52.5505690Z 
2025-05-07T20:30:52.5527252Z ################################################################################
2025-05-07T20:30:52.5547188Z # [2025-05-07T20:30:52.554Z] Run Python Test Suite:
2025-05-07T20:30:52.5547690Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:52.5548091Z ################################################################################
2025-05-07T20:30:52.5571723Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:52.5572497Z 
2025-05-07T20:30:54.7048222Z ============================= test session starts ==============================
2025-05-07T20:30:54.7049087Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:54.7049659Z cachedir: .pytest_cache
2025-05-07T20:30:54.7050259Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:54.7051007Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:54.7051429Z plugins: hypothesis-6.131.14
2025-05-07T20:30:56.2512265Z collecting ... collected 1 item
2025-05-07T20:30:56.2512591Z 
2025-05-07T20:30:56.9830792Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:30:56.9831150Z 
2025-05-07T20:30:56.9831305Z ============================== 1 passed in 2.41s ===============================
2025-05-07T20:30:57.6034355Z 
2025-05-07T20:30:57.6034934Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:30:57.6056729Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:30:57.6057623Z 
2025-05-07T20:30:57.6057631Z 
2025-05-07T20:30:57.6057637Z 
2025-05-07T20:30:57.6057654Z 
2025-05-07T20:30:57.6079069Z ################################################################################
2025-05-07T20:30:57.6094925Z # [2025-05-07T20:30:57.609Z] Run Python Test Suite:
2025-05-07T20:30:57.6095388Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.6095706Z ################################################################################
2025-05-07T20:30:57.6119398Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.6120022Z 
2025-05-07T20:30:59.7660435Z ============================= test session starts ==============================
2025-05-07T20:30:59.7661461Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:59.7662010Z cachedir: .pytest_cache
2025-05-07T20:30:59.7662665Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:59.7663413Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:59.7663850Z plugins: hypothesis-6.131.14
2025-05-07T20:31:01.3625014Z collecting ... collected 5 items
2025-05-07T20:31:01.3625355Z 
2025-05-07T20:31:01.3636120Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:01.3645040Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:01.3652770Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:01.3660515Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:01.3677752Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:01.3678228Z 
2025-05-07T20:31:01.3678753Z =========================== short test summary info ============================
2025-05-07T20:31:01.3679786Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.3680723Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.3681652Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.3682572Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.3683499Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.3684149Z ============================== 5 skipped in 1.73s ==============================
2025-05-07T20:31:01.9023374Z 
2025-05-07T20:31:01.9023736Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:01.9045754Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:01.9046179Z 
2025-05-07T20:31:01.9046185Z 
2025-05-07T20:31:01.9046191Z 
2025-05-07T20:31:01.9046196Z 
2025-05-07T20:31:01.9068253Z ################################################################################
2025-05-07T20:31:01.9083736Z # [2025-05-07T20:31:01.908Z] Run Python Test Suite:
2025-05-07T20:31:01.9084230Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:01.9084679Z ################################################################################
2025-05-07T20:31:01.9108877Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:01.9109651Z 
2025-05-07T20:31:04.0606238Z ============================= test session starts ==============================
2025-05-07T20:31:04.0607343Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:04.0607873Z cachedir: .pytest_cache
2025-05-07T20:31:04.0608467Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:04.0609204Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:04.0609618Z plugins: hypothesis-6.131.14
2025-05-07T20:31:05.7171123Z collecting ... collected 2 items
2025-05-07T20:31:05.7171367Z 
2025-05-07T20:31:05.7183074Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:05.7198043Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:05.7198470Z 
2025-05-07T20:31:05.7198650Z =========================== short test summary info ============================
2025-05-07T20:31:05.7199300Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.7200148Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.7200782Z ============================== 2 skipped in 1.79s ==============================
2025-05-07T20:31:06.2550744Z 
2025-05-07T20:31:06.2551551Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:06.2572323Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:06.2572752Z 
2025-05-07T20:31:06.2572756Z 
2025-05-07T20:31:06.2572760Z 
2025-05-07T20:31:06.2572764Z 
2025-05-07T20:31:06.2595184Z ################################################################################
2025-05-07T20:31:06.2610276Z # [2025-05-07T20:31:06.260Z] Run Python Test Suite:
2025-05-07T20:31:06.2611127Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.2611455Z ################################################################################
2025-05-07T20:31:06.2634944Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.2635584Z 
2025-05-07T20:31:08.4232032Z ============================= test session starts ==============================
2025-05-07T20:31:08.4232709Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:08.4233254Z cachedir: .pytest_cache
2025-05-07T20:31:08.4233862Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:08.4234615Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:08.4235036Z plugins: hypothesis-6.131.14
2025-05-07T20:31:09.9965474Z collecting ... collected 4 items
2025-05-07T20:31:09.9965816Z 
2025-05-07T20:31:13.0394051Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:13.0558084Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:13.0750857Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:13.0911582Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:13.0912100Z 
2025-05-07T20:31:13.0912317Z =========================== short test summary info ============================
2025-05-07T20:31:13.0913150Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:13.0914081Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:13.0914997Z ============================== 4 skipped in 4.80s ==============================
2025-05-07T20:31:14.6873895Z 
2025-05-07T20:31:14.6874458Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:14.6895330Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:14.6895743Z 
2025-05-07T20:31:14.6895749Z 
2025-05-07T20:31:14.6895762Z 
2025-05-07T20:31:14.6895767Z 
2025-05-07T20:31:14.6917946Z ################################################################################
2025-05-07T20:31:14.6933876Z # [2025-05-07T20:31:14.693Z] Run Python Test Suite:
2025-05-07T20:31:14.6934352Z #   ./moe/activation_test.py
2025-05-07T20:31:14.6934756Z ################################################################################
2025-05-07T20:31:14.6959508Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:14.6960140Z 
2025-05-07T20:31:16.8514162Z ============================= test session starts ==============================
2025-05-07T20:31:16.8515100Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:16.8515826Z cachedir: .pytest_cache
2025-05-07T20:31:16.8516423Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:16.8517164Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:16.8517595Z plugins: hypothesis-6.131.14
2025-05-07T20:31:18.5053563Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:18.7195883Z collecting ... collected 2 items
2025-05-07T20:31:18.7205420Z 
2025-05-07T20:31:24.7283496Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:24.7284174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7284987Z     T=1,
2025-05-07T20:31:24.7285194Z     D=5120,
2025-05-07T20:31:24.7285415Z     contiguous=True,
2025-05-07T20:31:24.7285673Z     compiled=True,
2025-05-07T20:31:24.7285898Z )
2025-05-07T20:31:24.7286119Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7286621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7287200Z     T=4096,
2025-05-07T20:31:24.7287490Z     D=5120,
2025-05-07T20:31:24.7287779Z     contiguous=True,
2025-05-07T20:31:24.7288104Z     compiled=True,
2025-05-07T20:31:24.7288411Z )
2025-05-07T20:31:24.7288711Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7289139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7289540Z     T=4096,
2025-05-07T20:31:24.7289744Z     D=7168,
2025-05-07T20:31:24.7289952Z     contiguous=False,
2025-05-07T20:31:24.7290199Z     compiled=False,
2025-05-07T20:31:24.7290419Z )
2025-05-07T20:31:24.7290625Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7291040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7291439Z     T=4096,
2025-05-07T20:31:24.7291635Z     D=5120,
2025-05-07T20:31:24.7291851Z     contiguous=False,
2025-05-07T20:31:24.7292106Z     compiled=True,
2025-05-07T20:31:24.7292327Z )
2025-05-07T20:31:24.7292534Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7292933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7293331Z     T=1,
2025-05-07T20:31:24.7293524Z     D=7168,
2025-05-07T20:31:24.7293736Z     contiguous=True,
2025-05-07T20:31:24.7293976Z     compiled=True,
2025-05-07T20:31:24.7294188Z )
2025-05-07T20:31:24.7294398Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7294794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7295183Z     T=1,
2025-05-07T20:31:24.7295385Z     D=7168,
2025-05-07T20:31:24.7295596Z     contiguous=False,
2025-05-07T20:31:24.7295829Z     compiled=True,
2025-05-07T20:31:24.7296294Z )
2025-05-07T20:31:24.7296535Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7296927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7297322Z     T=4096,
2025-05-07T20:31:24.7297527Z     D=5120,
2025-05-07T20:31:24.7297733Z     contiguous=False,
2025-05-07T20:31:24.7297979Z     compiled=False,
2025-05-07T20:31:24.7298206Z )
2025-05-07T20:31:24.7298415Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7298811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7299213Z     T=1,
2025-05-07T20:31:24.7299411Z     D=7168,
2025-05-07T20:31:24.7299617Z     contiguous=True,
2025-05-07T20:31:24.7299863Z     compiled=False,
2025-05-07T20:31:24.7300093Z )
2025-05-07T20:31:24.7300299Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7300698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7301093Z     T=2048,
2025-05-07T20:31:24.7301448Z     D=5120,
2025-05-07T20:31:24.7301672Z     contiguous=True,
2025-05-07T20:31:24.7301916Z     compiled=True,
2025-05-07T20:31:24.7302128Z )
2025-05-07T20:31:24.7302343Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7302743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7303133Z     T=2048,
2025-05-07T20:31:24.7303339Z     D=7168,
2025-05-07T20:31:24.7303550Z     contiguous=True,
2025-05-07T20:31:24.7303788Z     compiled=True,
2025-05-07T20:31:24.7304009Z )
2025-05-07T20:31:24.7304225Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7304621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7305022Z     T=2048,
2025-05-07T20:31:24.7305225Z     D=7168,
2025-05-07T20:31:24.7305435Z     contiguous=True,
2025-05-07T20:31:24.7305671Z     compiled=False,
2025-05-07T20:31:24.7305893Z )
2025-05-07T20:31:24.7306108Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7306500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7307005Z     T=128,
2025-05-07T20:31:24.7307209Z     D=5120,
2025-05-07T20:31:24.7307414Z     contiguous=False,
2025-05-07T20:31:24.7307663Z     compiled=True,
2025-05-07T20:31:24.7307885Z )
2025-05-07T20:31:24.7308090Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7308489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7308883Z     T=128,
2025-05-07T20:31:24.7309076Z     D=5120,
2025-05-07T20:31:24.7309288Z     contiguous=True,
2025-05-07T20:31:24.7309533Z     compiled=True,
2025-05-07T20:31:24.7309751Z )
2025-05-07T20:31:24.7309969Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7310368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7310758Z     T=16384,
2025-05-07T20:31:24.7310971Z     D=5120,
2025-05-07T20:31:24.7311189Z     contiguous=False,
2025-05-07T20:31:24.7311426Z     compiled=True,
2025-05-07T20:31:24.7311652Z )
2025-05-07T20:31:24.7311876Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7312287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7312678Z     T=16384,
2025-05-07T20:31:24.7312890Z     D=5120,
2025-05-07T20:31:24.7313109Z     contiguous=False,
2025-05-07T20:31:24.7313347Z     compiled=False,
2025-05-07T20:31:24.7313567Z )
2025-05-07T20:31:24.7313782Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7314173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7314578Z     T=128,
2025-05-07T20:31:24.7314777Z     D=7168,
2025-05-07T20:31:24.7314983Z     contiguous=True,
2025-05-07T20:31:24.7315216Z     compiled=False,
2025-05-07T20:31:24.7315433Z )
2025-05-07T20:31:24.7315645Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7316031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7316475Z     T=128,
2025-05-07T20:31:24.7316672Z     D=7168,
2025-05-07T20:31:24.7316875Z     contiguous=False,
2025-05-07T20:31:24.7317123Z     compiled=False,
2025-05-07T20:31:24.7317444Z )
2025-05-07T20:31:24.7317649Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7318047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7318444Z     T=1,
2025-05-07T20:31:24.7318638Z     D=5120,
2025-05-07T20:31:24.7318849Z     contiguous=False,
2025-05-07T20:31:24.7319095Z     compiled=False,
2025-05-07T20:31:24.7319308Z )
2025-05-07T20:31:24.7319528Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7319924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7320310Z     T=1,
2025-05-07T20:31:24.7320508Z     D=7168,
2025-05-07T20:31:24.7320719Z     contiguous=False,
2025-05-07T20:31:24.7320954Z     compiled=False,
2025-05-07T20:31:24.7321176Z )
2025-05-07T20:31:24.7321389Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7321783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7322171Z     T=4096,
2025-05-07T20:31:24.7322373Z     D=5120,
2025-05-07T20:31:24.7322596Z     contiguous=True,
2025-05-07T20:31:24.7322831Z     compiled=False,
2025-05-07T20:31:24.7323054Z )
2025-05-07T20:31:24.7323273Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7323663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7324058Z     T=128,
2025-05-07T20:31:24.7324257Z     D=7168,
2025-05-07T20:31:24.7324462Z     contiguous=True,
2025-05-07T20:31:24.7324699Z     compiled=True,
2025-05-07T20:31:24.7324917Z )
2025-05-07T20:31:24.7325124Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7325515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7325908Z     T=1,
2025-05-07T20:31:24.7326099Z     D=5120,
2025-05-07T20:31:24.7326312Z     contiguous=False,
2025-05-07T20:31:24.7326551Z     compiled=True,
2025-05-07T20:31:24.7326763Z )
2025-05-07T20:31:24.7326974Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7327366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7327895Z     T=4096,
2025-05-07T20:31:24.7328177Z     D=7168,
2025-05-07T20:31:24.7328394Z     contiguous=True,
2025-05-07T20:31:24.7328635Z     compiled=False,
2025-05-07T20:31:24.7328851Z )
2025-05-07T20:31:24.7329061Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7329454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7329844Z     T=4096,
2025-05-07T20:31:24.7330047Z     D=7168,
2025-05-07T20:31:24.7330256Z     contiguous=False,
2025-05-07T20:31:24.7330491Z     compiled=True,
2025-05-07T20:31:24.7330712Z )
2025-05-07T20:31:24.7330923Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7331312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7331720Z     T=128,
2025-05-07T20:31:24.7331922Z     D=5120,
2025-05-07T20:31:24.7332123Z     contiguous=True,
2025-05-07T20:31:24.7332363Z     compiled=False,
2025-05-07T20:31:24.7332584Z )
2025-05-07T20:31:24.7332788Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7333197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7333592Z     T=128,
2025-05-07T20:31:24.7333786Z     D=5120,
2025-05-07T20:31:24.7333999Z     contiguous=False,
2025-05-07T20:31:24.7334244Z     compiled=False,
2025-05-07T20:31:24.7334458Z )
2025-05-07T20:31:24.7334670Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7335068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7335465Z     T=1,
2025-05-07T20:31:24.7335660Z     D=5120,
2025-05-07T20:31:24.7335873Z     contiguous=True,
2025-05-07T20:31:24.7336172Z     compiled=False,
2025-05-07T20:31:24.7336425Z )
2025-05-07T20:31:24.7336628Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7337025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7337421Z     T=2048,
2025-05-07T20:31:24.7337614Z     D=7168,
2025-05-07T20:31:24.7337823Z     contiguous=False,
2025-05-07T20:31:24.7338064Z     compiled=True,
2025-05-07T20:31:24.7338382Z )
2025-05-07T20:31:24.7338591Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7338984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7339372Z     T=2048,
2025-05-07T20:31:24.7339574Z     D=7168,
2025-05-07T20:31:24.7339785Z     contiguous=False,
2025-05-07T20:31:24.7340020Z     compiled=False,
2025-05-07T20:31:24.7340554Z )
2025-05-07T20:31:24.7340769Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7341250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7341652Z     T=16384,
2025-05-07T20:31:24.7341863Z     D=7168,
2025-05-07T20:31:24.7342069Z     contiguous=False,
2025-05-07T20:31:24.7342310Z     compiled=True,
2025-05-07T20:31:24.7342531Z )
2025-05-07T20:31:24.7342743Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7343136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7343531Z     T=16384,
2025-05-07T20:31:24.7343748Z     D=7168,
2025-05-07T20:31:24.7343955Z     contiguous=True,
2025-05-07T20:31:24.7344198Z     compiled=True,
2025-05-07T20:31:24.7344419Z )
2025-05-07T20:31:24.7344624Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7345026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7345424Z     T=4096,
2025-05-07T20:31:24.7345622Z     D=7168,
2025-05-07T20:31:24.7345835Z     contiguous=True,
2025-05-07T20:31:24.7346080Z     compiled=True,
2025-05-07T20:31:24.7346292Z )
2025-05-07T20:31:24.7346508Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7346907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7347297Z     T=2048,
2025-05-07T20:31:24.7347499Z     D=5120,
2025-05-07T20:31:24.7347710Z     contiguous=False,
2025-05-07T20:31:24.7347946Z     compiled=False,
2025-05-07T20:31:24.7348170Z )
2025-05-07T20:31:24.7348384Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7348931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7349337Z     T=2048,
2025-05-07T20:31:24.7349542Z     D=5120,
2025-05-07T20:31:24.7349753Z     contiguous=True,
2025-05-07T20:31:24.7349990Z     compiled=False,
2025-05-07T20:31:24.7350213Z )
2025-05-07T20:31:24.7350425Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7350815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7351215Z     T=128,
2025-05-07T20:31:24.7351419Z     D=7168,
2025-05-07T20:31:24.7351627Z     contiguous=False,
2025-05-07T20:31:24.7351869Z     compiled=True,
2025-05-07T20:31:24.7352087Z )
2025-05-07T20:31:24.7352293Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7352693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7353090Z     T=16384,
2025-05-07T20:31:24.7353295Z     D=5120,
2025-05-07T20:31:24.7353508Z     contiguous=True,
2025-05-07T20:31:24.7353746Z     compiled=True,
2025-05-07T20:31:24.7353959Z )
2025-05-07T20:31:24.7354175Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7354581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7354971Z     T=2048,
2025-05-07T20:31:24.7355177Z     D=5120,
2025-05-07T20:31:24.7355387Z     contiguous=False,
2025-05-07T20:31:24.7355630Z     compiled=True,
2025-05-07T20:31:24.7355842Z )
2025-05-07T20:31:24.7356054Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7356452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7356841Z     T=16384,
2025-05-07T20:31:24.7357052Z     D=5120,
2025-05-07T20:31:24.7357264Z     contiguous=True,
2025-05-07T20:31:24.7357497Z     compiled=False,
2025-05-07T20:31:24.7357720Z )
2025-05-07T20:31:24.7357933Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7358324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7358719Z     T=16384,
2025-05-07T20:31:24.7358929Z     D=7168,
2025-05-07T20:31:24.7359136Z     contiguous=False,
2025-05-07T20:31:24.7359547Z     compiled=False,
2025-05-07T20:31:24.7359770Z )
2025-05-07T20:31:24.7359980Z Trying example: test_silu_mul(
2025-05-07T20:31:24.7360381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.7360776Z     T=16384,
2025-05-07T20:31:24.7360982Z     D=7168,
2025-05-07T20:31:24.7361193Z     contiguous=True,
2025-05-07T20:31:24.7361432Z     compiled=False,
2025-05-07T20:31:24.7361648Z )
2025-05-07T20:31:24.7361840Z PASSED
2025-05-07T20:31:24.7988871Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.7989986Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.7991385Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.7992885Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.7994297Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.7995713Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.7997387Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.7998807Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.8000253Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.8001529Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.8002777Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.8004011Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.8005075Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.8006124Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.8007424Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.8008731Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.8009872Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.8011081Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.8012282Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.8013665Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.8014752Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.8015699Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.8016472Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.8017522Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.8168067Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.8169167Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.8170870Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.8172341Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.8173749Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.8175158Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.8176487Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.8177891Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.8179345Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.8180616Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.8181949Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.8183187Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.8184392Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.8185434Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.8186683Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.8187999Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.8189143Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.8190208Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.8191406Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.8192787Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.8193875Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.8194807Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.8195698Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.8196799Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.8597559Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.8598659Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.8600025Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.8601521Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.8602936Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.8604354Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.8605694Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.8607107Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.8608891Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.8610172Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.8611419Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.8612648Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.8613711Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.8614755Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.8615996Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.8617303Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.8618442Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.8619678Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.8620879Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.8622355Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.8623441Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.8624379Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.8625140Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.8626205Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.8639581Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.8640876Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.8642237Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.8643690Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.8645267Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.8646679Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.8648009Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.8649419Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.8650870Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.8652148Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.8653395Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.8664511Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.8666173Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.8667641Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.8669038Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.8670339Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.8671465Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.8672521Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.8673712Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.8675062Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.8676129Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.8677040Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.8677787Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.8678829Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.3759492Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:25.3760389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:25.3760822Z     T=1,
2025-05-07T20:31:25.3761016Z     D=5120,
2025-05-07T20:31:25.3761229Z     scale_ub=None,
2025-05-07T20:31:25.3761460Z     contiguous=True,
2025-05-07T20:31:25.3761702Z     compiled=True,
2025-05-07T20:31:25.3761918Z )
2025-05-07T20:31:25.3762258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:25.3762765Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:25.3763036Z 
2025-05-07T20:31:25.3763121Z     @given(
2025-05-07T20:31:25.3763389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:25.3763735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:25.3764052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:25.3764403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:25.3764754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:25.3765054Z     )
2025-05-07T20:31:25.3765416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:25.3765871Z     def test_silu_mul_quant(
2025-05-07T20:31:25.3766128Z         self,
2025-05-07T20:31:25.3766336Z         T: int,
2025-05-07T20:31:25.3766573Z         D: int,
2025-05-07T20:31:25.3766832Z         scale_ub: Optional[float],
2025-05-07T20:31:25.3767115Z         contiguous: bool,
2025-05-07T20:31:25.3767376Z         compiled: bool,
2025-05-07T20:31:25.3767617Z     ) -> None:
2025-05-07T20:31:25.3767843Z         torch.manual_seed(2025)
2025-05-07T20:31:25.3768101Z     
2025-05-07T20:31:25.3768679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:25.3769048Z     
2025-05-07T20:31:25.3769256Z         x_sign = torch.sign(x)
2025-05-07T20:31:25.3769565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:25.3769886Z         x = x_sign * x_clamp
2025-05-07T20:31:25.3770139Z         x0 = x[:, :D]
2025-05-07T20:31:25.3770368Z         x1 = x[:, D:]
2025-05-07T20:31:25.3770584Z     
2025-05-07T20:31:25.3770781Z         if contiguous:
2025-05-07T20:31:25.3771029Z             x0 = x0.contiguous()
2025-05-07T20:31:25.3771303Z             x1 = x1.contiguous()
2025-05-07T20:31:25.3771551Z     
2025-05-07T20:31:25.3771759Z         if scale_ub is not None:
2025-05-07T20:31:25.3772045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:25.3772393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:25.3772719Z             )
2025-05-07T20:31:25.3772927Z         else:
2025-05-07T20:31:25.3773143Z             scale_ub_tensor = None
2025-05-07T20:31:25.3773420Z     
2025-05-07T20:31:25.3773668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:25.3773993Z             op = silu_mul_quant
2025-05-07T20:31:25.3774263Z             if compiled:
2025-05-07T20:31:25.3774529Z                 op = torch.compile(op)
2025-05-07T20:31:25.3774834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:25.3775128Z     
2025-05-07T20:31:25.3775337Z         y_fp8, y_scale = fn()
2025-05-07T20:31:25.3775633Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:25.3775942Z     
2025-05-07T20:31:25.3776193Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:25.3776540Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:25.3776841Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:25.3777173Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:25.3777549Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:25.3778889Z     
2025-05-07T20:31:25.3779112Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:25.3779313Z 
2025-05-07T20:31:25.3779428Z moe/activation_test.py:126: 
2025-05-07T20:31:25.3779733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:25.3780084Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:25.3780429Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:25.3781331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:25.3782103Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:25.3782677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:25.3783378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:25.3784091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:25.3784841Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:25.3785612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:25.3786366Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:25.3787093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:25.3787742Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:25.3788363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:25.3788895Z     fn()
2025-05-07T20:31:25.3789401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:25.3790082Z     self.fn.run(
2025-05-07T20:31:25.3790569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:25.3791106Z     kernel = self.compile(
2025-05-07T20:31:25.3791666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:25.3792333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.3792746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:25.3792980Z 
2025-05-07T20:31:25.3793195Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac2a8b0>
2025-05-07T20:31:25.3794297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:25.3795698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a5ba040>}
2025-05-07T20:31:25.3797056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:25.3798093Z context = <triton._C.libtriton.ir.context object at 0x7f317ac780b0>
2025-05-07T20:31:25.3798401Z 
2025-05-07T20:31:25.3798578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:25.3799111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.3799578Z                            module_map=module_map)
2025-05-07T20:31:25.3799963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.3800335Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:25.3800611Z E       ^
2025-05-07T20:31:25.3801182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.3801641Z 
2025-05-07T20:31:25.3802070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:25.3802586Z 
2025-05-07T20:31:25.3802700Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:25.3803120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:25.3803531Z     T=2048,
2025-05-07T20:31:25.3803728Z     D=5120,
2025-05-07T20:31:25.3803925Z     scale_ub=1200.0,
2025-05-07T20:31:25.3804156Z     contiguous=True,
2025-05-07T20:31:25.3804388Z     compiled=False,
2025-05-07T20:31:25.3804597Z )
2025-05-07T20:31:25.9693574Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:25.9694830Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:25.9696176Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:25.9697614Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:25.9699007Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:25.9700741Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.9702187Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:25.9703553Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.9704973Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:25.9706225Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:25.9707457Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:25.9708665Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:25.9709704Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:25.9710723Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:25.9711950Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:25.9713412Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:25.9714530Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:25.9715574Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:25.9716792Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:25.9718163Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:25.9719236Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.9720150Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:25.9720899Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:25.9721931Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.1779947Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.1781816Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.1783186Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.1784612Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.1785998Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.1787403Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.1788726Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.1790102Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.1791513Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.1792759Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.1793996Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.1795398Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.1796441Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:26.1797464Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.1798698Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.1800000Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.1801125Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:26.1802179Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.1803361Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.1804730Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.1805869Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.1806801Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.1807555Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.1808574Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.7408884Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.7410092Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.7411470Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.7412990Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.7414381Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.7415785Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.7417598Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.7418987Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.7420423Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.7421800Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.7423041Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.7424267Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.7425308Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:26.7426331Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.7427569Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.7429008Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.7430138Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:26.7431182Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.7432371Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.7433719Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.7434804Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.7435725Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.7436475Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.7437510Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.7801737Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.7803233Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.7804736Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.7806352Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.7807744Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.7809133Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.7810461Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.7811850Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.7813281Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.7814527Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.7815864Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.7817084Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.7818123Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:26.7819149Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.7820385Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.7821763Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.7822891Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:26.7823950Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.7825137Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.7826500Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.7827581Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.7828683Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.7829444Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.7830479Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.5399147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.5399919Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:27.5400316Z 
2025-05-07T20:31:27.5400430Z     @given(
2025-05-07T20:31:27.5400774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.5401215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.5401562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.5401921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.5402266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.5402566Z     )
2025-05-07T20:31:27.5402921Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.5403374Z     def test_silu_mul_quant(
2025-05-07T20:31:27.5403625Z         self,
2025-05-07T20:31:27.5403823Z         T: int,
2025-05-07T20:31:27.5404028Z         D: int,
2025-05-07T20:31:27.5404257Z         scale_ub: Optional[float],
2025-05-07T20:31:27.5404531Z         contiguous: bool,
2025-05-07T20:31:27.5404780Z         compiled: bool,
2025-05-07T20:31:27.5405026Z     ) -> None:
2025-05-07T20:31:27.5405243Z         torch.manual_seed(2025)
2025-05-07T20:31:27.5405560Z     
2025-05-07T20:31:27.5405957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.5406444Z     
2025-05-07T20:31:27.5406645Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.5407408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.5407924Z         x = x_sign * x_clamp
2025-05-07T20:31:27.5408285Z         x0 = x[:, :D]
2025-05-07T20:31:27.5408602Z         x1 = x[:, D:]
2025-05-07T20:31:27.5408825Z     
2025-05-07T20:31:27.5409021Z         if contiguous:
2025-05-07T20:31:27.5409264Z             x0 = x0.contiguous()
2025-05-07T20:31:27.5409525Z             x1 = x1.contiguous()
2025-05-07T20:31:27.5409775Z     
2025-05-07T20:31:27.5409975Z         if scale_ub is not None:
2025-05-07T20:31:27.5410251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.5410596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.5410916Z             )
2025-05-07T20:31:27.5411114Z         else:
2025-05-07T20:31:27.5411335Z             scale_ub_tensor = None
2025-05-07T20:31:27.5411601Z     
2025-05-07T20:31:27.5411847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.5412184Z             op = silu_mul_quant
2025-05-07T20:31:27.5412444Z             if compiled:
2025-05-07T20:31:27.5412705Z                 op = torch.compile(op)
2025-05-07T20:31:27.5413015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.5413291Z     
2025-05-07T20:31:27.5413491Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:27.5413659Z 
2025-05-07T20:31:27.5413770Z moe/activation_test.py:117: 
2025-05-07T20:31:27.5414066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.5414408Z moe/activation_test.py:115: in fn
2025-05-07T20:31:27.5414701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.5415399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:27.5416090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:27.5416645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.5417527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.5418192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.5418743Z     kernel = self.compile(
2025-05-07T20:31:27.5419289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.5419948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.5420347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.5420587Z 
2025-05-07T20:31:27.5420794Z self = <triton.compiler.compiler.ASTSource object at 0x7f3176335790>
2025-05-07T20:31:27.5421985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.5423387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a60f9d0>}
2025-05-07T20:31:27.5424714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.5425730Z context = <triton._C.libtriton.ir.context object at 0x7f317a523430>
2025-05-07T20:31:27.5426030Z 
2025-05-07T20:31:27.5426203Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.5426731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.5427235Z                            module_map=module_map)
2025-05-07T20:31:27.5427627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.5428076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.5428347Z E       ^
2025-05-07T20:31:27.5428811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.5429266Z 
2025-05-07T20:31:27.5429684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.5430194Z 
2025-05-07T20:31:27.5430305Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.5430725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.5431126Z     T=2048,
2025-05-07T20:31:27.5431322Z     D=5120,
2025-05-07T20:31:27.5431522Z     scale_ub=1200.0,
2025-05-07T20:31:27.5431749Z     contiguous=True,
2025-05-07T20:31:27.5431978Z     compiled=True,
2025-05-07T20:31:27.5432197Z )
2025-05-07T20:31:27.5432518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.5433024Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:27.5433302Z 
2025-05-07T20:31:27.5433389Z     @given(
2025-05-07T20:31:27.5433621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.5433946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.5434262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.5434598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.5434929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.5435223Z     )
2025-05-07T20:31:27.5435585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.5436030Z     def test_silu_mul_quant(
2025-05-07T20:31:27.5436283Z         self,
2025-05-07T20:31:27.5436489Z         T: int,
2025-05-07T20:31:27.5436688Z         D: int,
2025-05-07T20:31:27.5436922Z         scale_ub: Optional[float],
2025-05-07T20:31:27.5437208Z         contiguous: bool,
2025-05-07T20:31:27.5437453Z         compiled: bool,
2025-05-07T20:31:27.5437828Z     ) -> None:
2025-05-07T20:31:27.5438055Z         torch.manual_seed(2025)
2025-05-07T20:31:27.5438306Z     
2025-05-07T20:31:27.5438588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.5438940Z     
2025-05-07T20:31:27.5439139Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.5439442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.5439762Z         x = x_sign * x_clamp
2025-05-07T20:31:27.5440014Z         x0 = x[:, :D]
2025-05-07T20:31:27.5440672Z         x1 = x[:, D:]
2025-05-07T20:31:27.5440972Z     
2025-05-07T20:31:27.5441232Z         if contiguous:
2025-05-07T20:31:27.5441494Z             x0 = x0.contiguous()
2025-05-07T20:31:27.5441761Z             x1 = x1.contiguous()
2025-05-07T20:31:27.5442006Z     
2025-05-07T20:31:27.5442195Z         if scale_ub is not None:
2025-05-07T20:31:27.5442474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.5442821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.5443137Z             )
2025-05-07T20:31:27.5443356Z         else:
2025-05-07T20:31:27.5443574Z             scale_ub_tensor = None
2025-05-07T20:31:27.5443830Z     
2025-05-07T20:31:27.5444066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.5444389Z             op = silu_mul_quant
2025-05-07T20:31:27.5444641Z             if compiled:
2025-05-07T20:31:27.5444895Z                 op = torch.compile(op)
2025-05-07T20:31:27.5445202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.5445478Z     
2025-05-07T20:31:27.5445679Z         y_fp8, y_scale = fn()
2025-05-07T20:31:27.5445971Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:27.5446261Z     
2025-05-07T20:31:27.5446504Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.5446847Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:27.5447140Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:27.5456593Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:27.5456991Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.5457319Z     
2025-05-07T20:31:27.5457526Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:27.5457741Z 
2025-05-07T20:31:27.5457847Z moe/activation_test.py:126: 
2025-05-07T20:31:27.5458156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.5458496Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:27.5458842Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.5459651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:27.5460411Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:27.5460961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.5461752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.5462459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:27.5463180Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.5463938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:27.5464700Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.5465432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:27.5466069Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:27.5466686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:27.5467353Z     fn()
2025-05-07T20:31:27.5467862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:27.5468436Z     self.fn.run(
2025-05-07T20:31:27.5468912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.5469457Z     kernel = self.compile(
2025-05-07T20:31:27.5469996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.5470655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.5471073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.5471309Z 
2025-05-07T20:31:27.5471524Z self = <triton.compiler.compiler.ASTSource object at 0x7f317abb3700>
2025-05-07T20:31:27.5472600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.5473993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a693a60>}
2025-05-07T20:31:27.5475348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.5476370Z context = <triton._C.libtriton.ir.context object at 0x7f313c30bd30>
2025-05-07T20:31:27.5476658Z 
2025-05-07T20:31:27.5476836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.5477367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.5477852Z                            module_map=module_map)
2025-05-07T20:31:27.5478313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.5478678Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:27.5478952Z E       ^
2025-05-07T20:31:27.5479424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.5479872Z 
2025-05-07T20:31:27.5480298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.5480816Z 
2025-05-07T20:31:27.5480922Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.5481341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.5481747Z     T=16384,
2025-05-07T20:31:27.5481941Z     D=7168,
2025-05-07T20:31:27.5482141Z     scale_ub=1200.0,
2025-05-07T20:31:27.5482373Z     contiguous=False,
2025-05-07T20:31:27.5482601Z     compiled=False,
2025-05-07T20:31:27.5482817Z )
2025-05-07T20:31:27.9704478Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.9706645Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.9708202Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.9709650Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.9711048Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.9712663Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.9713992Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.9715397Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.9716842Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.9718158Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.9719395Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.9720617Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.9721680Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:27.9722720Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.9724086Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.9725406Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.9726540Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:27.9727608Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.9728808Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.9730176Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.9731270Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.9732210Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.9732974Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.9734013Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.1320504Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.1321940Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.1323302Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.1324741Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.1326123Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.1327515Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.1328848Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.1330236Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.1331664Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.1333099Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.1334341Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.1335559Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.1336614Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:28.1337652Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.1338890Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.1340431Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.1341668Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:28.1342729Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.1343928Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.1345311Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.1346515Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.1347444Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.1348207Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.1349249Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.6265297Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.6266432Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.6267791Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.6269356Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.6270753Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.6273030Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.6274370Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.6275769Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.6277213Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.6278479Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.6279718Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.6280939Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.6281982Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:28.6283020Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.6284258Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.6285699Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.6286815Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:28.6287870Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.6289061Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.6290428Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.6291508Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.6292427Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.6293188Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.6294225Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.6661927Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.6663144Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.6664494Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.6665932Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.6667324Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.6668774Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.6670092Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.6671484Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.6672903Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.6674168Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.6675577Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.6676786Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.6677840Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:28.6678883Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.6680120Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.6681419Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.6682545Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:28.6683600Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.6684797Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.6686163Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.6687329Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.6688260Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.6689022Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.6690057Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.1524381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:30.1524950Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:30.1525246Z 
2025-05-07T20:31:30.1525333Z     @given(
2025-05-07T20:31:30.1525625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:30.1525955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:30.1526269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:30.1526617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:30.1526963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:30.1527258Z     )
2025-05-07T20:31:30.1527621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:30.1528075Z     def test_silu_mul_quant(
2025-05-07T20:31:30.1528333Z         self,
2025-05-07T20:31:30.1528535Z         T: int,
2025-05-07T20:31:30.1528743Z         D: int,
2025-05-07T20:31:30.1528977Z         scale_ub: Optional[float],
2025-05-07T20:31:30.1529253Z         contiguous: bool,
2025-05-07T20:31:30.1529504Z         compiled: bool,
2025-05-07T20:31:30.1529745Z     ) -> None:
2025-05-07T20:31:30.1529968Z         torch.manual_seed(2025)
2025-05-07T20:31:30.1530224Z     
2025-05-07T20:31:30.1530965Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:30.1531328Z     
2025-05-07T20:31:30.1531535Z         x_sign = torch.sign(x)
2025-05-07T20:31:30.1531836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:30.1532161Z         x = x_sign * x_clamp
2025-05-07T20:31:30.1532418Z         x0 = x[:, :D]
2025-05-07T20:31:30.1532641Z         x1 = x[:, D:]
2025-05-07T20:31:30.1532866Z     
2025-05-07T20:31:30.1533069Z         if contiguous:
2025-05-07T20:31:30.1533311Z             x0 = x0.contiguous()
2025-05-07T20:31:30.1533586Z             x1 = x1.contiguous()
2025-05-07T20:31:30.1533842Z     
2025-05-07T20:31:30.1534040Z         if scale_ub is not None:
2025-05-07T20:31:30.1534328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:30.1534683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:30.1535000Z             )
2025-05-07T20:31:30.1535212Z         else:
2025-05-07T20:31:30.1535457Z             scale_ub_tensor = None
2025-05-07T20:31:30.1535726Z     
2025-05-07T20:31:30.1535970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:30.1536304Z             op = silu_mul_quant
2025-05-07T20:31:30.1536573Z             if compiled:
2025-05-07T20:31:30.1536832Z                 op = torch.compile(op)
2025-05-07T20:31:30.1537142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:30.1537432Z     
2025-05-07T20:31:30.1537635Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:30.1537833Z 
2025-05-07T20:31:30.1537951Z moe/activation_test.py:117: 
2025-05-07T20:31:30.1538283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:30.1538621Z moe/activation_test.py:115: in fn
2025-05-07T20:31:30.1538917Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:30.1539622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:30.1540743Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:30.1541372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:30.1542069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:30.1542739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:30.1543274Z     kernel = self.compile(
2025-05-07T20:31:30.1543829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:30.1544494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.1544902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:30.1545136Z 
2025-05-07T20:31:30.1545347Z self = <triton.compiler.compiler.ASTSource object at 0x7f317acedb20>
2025-05-07T20:31:30.1546443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:30.1547852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317ad33700>}
2025-05-07T20:31:30.1549198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:30.1550222Z context = <triton._C.libtriton.ir.context object at 0x7f313c630e30>
2025-05-07T20:31:30.1550514Z 
2025-05-07T20:31:30.1550685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:30.1551219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.1551828Z                            module_map=module_map)
2025-05-07T20:31:30.1552203Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.1552575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.1552844Z E       ^
2025-05-07T20:31:30.1553318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.1553779Z 
2025-05-07T20:31:30.1554199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:30.1554733Z 
2025-05-07T20:31:30.1554841Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:30.1555265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:30.1555680Z     T=1,
2025-05-07T20:31:30.1555871Z     D=7168,
2025-05-07T20:31:30.1556076Z     scale_ub=None,
2025-05-07T20:31:30.1556301Z     contiguous=True,
2025-05-07T20:31:30.1556529Z     compiled=True,
2025-05-07T20:31:30.1556762Z )
2025-05-07T20:31:30.1557096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:30.1557584Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:30.1557854Z 
2025-05-07T20:31:30.1557934Z     @given(
2025-05-07T20:31:30.1558178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:30.1558494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:30.1558813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:30.1559158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:30.1559494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:30.1559799Z     )
2025-05-07T20:31:30.1560159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:30.1560616Z     def test_silu_mul_quant(
2025-05-07T20:31:30.1560862Z         self,
2025-05-07T20:31:30.1561071Z         T: int,
2025-05-07T20:31:30.1561286Z         D: int,
2025-05-07T20:31:30.1561591Z         scale_ub: Optional[float],
2025-05-07T20:31:30.1561878Z         contiguous: bool,
2025-05-07T20:31:30.1562127Z         compiled: bool,
2025-05-07T20:31:30.1562355Z     ) -> None:
2025-05-07T20:31:30.1562584Z         torch.manual_seed(2025)
2025-05-07T20:31:30.1562839Z     
2025-05-07T20:31:30.1563114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:30.1563472Z     
2025-05-07T20:31:30.1563681Z         x_sign = torch.sign(x)
2025-05-07T20:31:30.1563976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:30.1564301Z         x = x_sign * x_clamp
2025-05-07T20:31:30.1564555Z         x0 = x[:, :D]
2025-05-07T20:31:30.1564778Z         x1 = x[:, D:]
2025-05-07T20:31:30.1564996Z     
2025-05-07T20:31:30.1565193Z         if contiguous:
2025-05-07T20:31:30.1565434Z             x0 = x0.contiguous()
2025-05-07T20:31:30.1565696Z             x1 = x1.contiguous()
2025-05-07T20:31:30.1565949Z     
2025-05-07T20:31:30.1566161Z         if scale_ub is not None:
2025-05-07T20:31:30.1566441Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:30.1566787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:30.1567106Z             )
2025-05-07T20:31:30.1567302Z         else:
2025-05-07T20:31:30.1567525Z             scale_ub_tensor = None
2025-05-07T20:31:30.1567784Z     
2025-05-07T20:31:30.1568017Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:30.1568348Z             op = silu_mul_quant
2025-05-07T20:31:30.1568654Z             if compiled:
2025-05-07T20:31:30.1568912Z                 op = torch.compile(op)
2025-05-07T20:31:30.1569224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:30.1569512Z     
2025-05-07T20:31:30.1569707Z         y_fp8, y_scale = fn()
2025-05-07T20:31:30.1570008Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:30.1570313Z     
2025-05-07T20:31:30.1570563Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:30.1570992Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:30.1571301Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:30.1571627Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:30.1571994Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:30.1572313Z     
2025-05-07T20:31:30.1572528Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:30.1572727Z 
2025-05-07T20:31:30.1572837Z moe/activation_test.py:126: 
2025-05-07T20:31:30.1573138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:30.1573490Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:30.1573830Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:30.1574619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:30.1575391Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:30.1575965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:30.1576664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:30.1577355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:30.1578086Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:30.1578850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:30.1579607Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:30.1580336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:30.1581061Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:30.1581757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:30.1582281Z     fn()
2025-05-07T20:31:30.1582803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:30.1583391Z     self.fn.run(
2025-05-07T20:31:30.1583872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:30.1584402Z     kernel = self.compile(
2025-05-07T20:31:30.1584960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:30.1585619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.1586023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:30.1586262Z 
2025-05-07T20:31:30.1586482Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac0cdf0>
2025-05-07T20:31:30.1587576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:30.1588949Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317b531ee0>}
2025-05-07T20:31:30.1590299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:30.1591322Z context = <triton._C.libtriton.ir.context object at 0x7f313d8ad470>
2025-05-07T20:31:30.1591621Z 
2025-05-07T20:31:30.1591794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:30.1592336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.1592902Z                            module_map=module_map)
2025-05-07T20:31:30.1593272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.1593640Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:30.1593917Z E       ^
2025-05-07T20:31:30.1594382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.1594854Z 
2025-05-07T20:31:30.1595273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:30.1595807Z 
2025-05-07T20:31:30.1595913Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:30.1596333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:30.1596736Z     T=4096,
2025-05-07T20:31:30.1596933Z     D=5120,
2025-05-07T20:31:30.1597132Z     scale_ub=None,
2025-05-07T20:31:30.1597363Z     contiguous=False,
2025-05-07T20:31:30.1597600Z     compiled=False,
2025-05-07T20:31:30.1597813Z )
2025-05-07T20:31:30.7871697Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.7872811Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.7874165Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.7875729Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.7877750Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.7879165Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.7880491Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.7881889Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.7883327Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.7884587Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.7885829Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.7887057Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.7888119Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:30.7889327Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.7890556Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.7891863Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.7892999Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:30.7894059Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.7895273Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.7896651Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.7897722Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.7898646Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.7899399Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.7900512Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.3770013Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.3771122Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:31.3772475Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.3773914Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.3775319Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.3776732Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.3778051Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.3779451Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.3780883Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.3782397Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:31.3783631Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.3784831Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:31.3785892Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:31.3786922Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:31.3788156Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.3789458Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.3790589Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:31.3791655Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:31.3792965Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.3794325Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.3795396Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.3796320Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.3797080Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:31.3798124Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1494819Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.1495936Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:32.1497309Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.1498747Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.1500146Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.1502016Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1503338Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.1504735Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1506145Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.1507407Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:32.1508639Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.1509863Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:32.1510908Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:32.1511935Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:32.1513315Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.1514623Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.1515753Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:32.1516813Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:32.1517997Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.1519418Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.1520502Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1521437Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1522193Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:32.1523236Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1893300Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.1894549Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:32.1895894Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.1897320Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.1898769Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.1900342Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1901722Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.1903116Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1904542Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.1905932Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:32.1907160Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.1908383Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:32.1909426Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:32.1910452Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:32.1911694Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.1913007Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.1914143Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:32.1915203Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:32.1916385Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.1917833Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.1918915Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1919843Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1920605Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:32.1921634Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.7886005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.7886711Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:35.7887010Z 
2025-05-07T20:31:35.7887102Z     @given(
2025-05-07T20:31:35.7887360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.7887685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.7888016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.7888367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.7888701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.7889003Z     )
2025-05-07T20:31:35.7889367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.7889822Z     def test_silu_mul_quant(
2025-05-07T20:31:35.7890080Z         self,
2025-05-07T20:31:35.7890287Z         T: int,
2025-05-07T20:31:35.7890489Z         D: int,
2025-05-07T20:31:35.7890728Z         scale_ub: Optional[float],
2025-05-07T20:31:35.7891017Z         contiguous: bool,
2025-05-07T20:31:35.7891605Z         compiled: bool,
2025-05-07T20:31:35.7891855Z     ) -> None:
2025-05-07T20:31:35.7892084Z         torch.manual_seed(2025)
2025-05-07T20:31:35.7892342Z     
2025-05-07T20:31:35.7892625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.7892982Z     
2025-05-07T20:31:35.7893187Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.7893488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.7893816Z         x = x_sign * x_clamp
2025-05-07T20:31:35.7894074Z         x0 = x[:, :D]
2025-05-07T20:31:35.7894302Z         x1 = x[:, D:]
2025-05-07T20:31:35.7894519Z     
2025-05-07T20:31:35.7894717Z         if contiguous:
2025-05-07T20:31:35.7894953Z             x0 = x0.contiguous()
2025-05-07T20:31:35.7895225Z             x1 = x1.contiguous()
2025-05-07T20:31:35.7895480Z     
2025-05-07T20:31:35.7895677Z         if scale_ub is not None:
2025-05-07T20:31:35.7895964Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.7896318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.7896643Z             )
2025-05-07T20:31:35.7896840Z         else:
2025-05-07T20:31:35.7897063Z             scale_ub_tensor = None
2025-05-07T20:31:35.7897325Z     
2025-05-07T20:31:35.7897562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.7897891Z             op = silu_mul_quant
2025-05-07T20:31:35.7898157Z             if compiled:
2025-05-07T20:31:35.7898409Z                 op = torch.compile(op)
2025-05-07T20:31:35.7898724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.7899011Z     
2025-05-07T20:31:35.7899207Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.7899383Z 
2025-05-07T20:31:35.7899492Z moe/activation_test.py:117: 
2025-05-07T20:31:35.7899798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.7900133Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.7900426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.7901372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.7902081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.7902626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.7903326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.7904001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.7904545Z     kernel = self.compile(
2025-05-07T20:31:35.7905092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.7905757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.7906167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.7906409Z 
2025-05-07T20:31:35.7906627Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d975eb0>
2025-05-07T20:31:35.7907719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.7909137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c57af70>}
2025-05-07T20:31:35.7910486Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.7911515Z context = <triton._C.libtriton.ir.context object at 0x7f313c144070>
2025-05-07T20:31:35.7911806Z 
2025-05-07T20:31:35.7912057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.7912603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.7913083Z                            module_map=module_map)
2025-05-07T20:31:35.7913460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.7913818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.7914088Z E       ^
2025-05-07T20:31:35.7914560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.7915012Z 
2025-05-07T20:31:35.7915435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.7915960Z 
2025-05-07T20:31:35.7916069Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.7916494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.7916905Z     T=4096,
2025-05-07T20:31:35.7917107Z     D=7168,
2025-05-07T20:31:35.7917310Z     scale_ub=None,
2025-05-07T20:31:35.7917539Z     contiguous=False,
2025-05-07T20:31:35.7917772Z     compiled=False,
2025-05-07T20:31:35.7917994Z )
2025-05-07T20:31:35.7918323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.7918826Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:35.7919110Z 
2025-05-07T20:31:35.7919192Z     @given(
2025-05-07T20:31:35.7919442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.7919760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.7920088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.7920430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.7920774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.7921064Z     )
2025-05-07T20:31:35.7921427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.7921984Z     def test_silu_mul_quant(
2025-05-07T20:31:35.7922231Z         self,
2025-05-07T20:31:35.7922440Z         T: int,
2025-05-07T20:31:35.7922648Z         D: int,
2025-05-07T20:31:35.7922868Z         scale_ub: Optional[float],
2025-05-07T20:31:35.7923157Z         contiguous: bool,
2025-05-07T20:31:35.7923405Z         compiled: bool,
2025-05-07T20:31:35.7923631Z     ) -> None:
2025-05-07T20:31:35.7923859Z         torch.manual_seed(2025)
2025-05-07T20:31:35.7924113Z     
2025-05-07T20:31:35.7924391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.7924747Z     
2025-05-07T20:31:35.7924954Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.7925250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.7925577Z         x = x_sign * x_clamp
2025-05-07T20:31:35.7925833Z         x0 = x[:, :D]
2025-05-07T20:31:35.7926061Z         x1 = x[:, D:]
2025-05-07T20:31:35.7926271Z     
2025-05-07T20:31:35.7926466Z         if contiguous:
2025-05-07T20:31:35.7926717Z             x0 = x0.contiguous()
2025-05-07T20:31:35.7926983Z             x1 = x1.contiguous()
2025-05-07T20:31:35.7927234Z     
2025-05-07T20:31:35.7927433Z         if scale_ub is not None:
2025-05-07T20:31:35.7927711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.7928057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.7928376Z             )
2025-05-07T20:31:35.7928576Z         else:
2025-05-07T20:31:35.7928799Z             scale_ub_tensor = None
2025-05-07T20:31:35.7929063Z     
2025-05-07T20:31:35.7929300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.7929632Z             op = silu_mul_quant
2025-05-07T20:31:35.7929896Z             if compiled:
2025-05-07T20:31:35.7930147Z                 op = torch.compile(op)
2025-05-07T20:31:35.7930455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.7930743Z     
2025-05-07T20:31:35.7930948Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.7931125Z 
2025-05-07T20:31:35.7931312Z moe/activation_test.py:117: 
2025-05-07T20:31:35.7931621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.7931965Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.7932252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.7932957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.7933658Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.7934210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.7934905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.7935575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.7936115Z     kernel = self.compile(
2025-05-07T20:31:35.7936671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.7937340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.7937749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.7937984Z 
2025-05-07T20:31:35.7938203Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac00640>
2025-05-07T20:31:35.7939286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.7940842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c48f940>}
2025-05-07T20:31:35.7942235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.7943409Z context = <triton._C.libtriton.ir.context object at 0x7f313d8033b0>
2025-05-07T20:31:35.7943700Z 
2025-05-07T20:31:35.7943880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.7944406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.7944882Z                            module_map=module_map)
2025-05-07T20:31:35.7945261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.7945619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.7945888Z E       ^
2025-05-07T20:31:35.7946361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.7946814Z 
2025-05-07T20:31:35.7947253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.7947780Z 
2025-05-07T20:31:35.7947888Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.7948319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.7948737Z     T=128,
2025-05-07T20:31:35.7948928Z     D=7168,
2025-05-07T20:31:35.7949139Z     scale_ub=None,
2025-05-07T20:31:35.7949366Z     contiguous=False,
2025-05-07T20:31:35.7949624Z     compiled=True,
2025-05-07T20:31:35.7949863Z )
2025-05-07T20:31:35.8774497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.8775568Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:35.8776113Z 
2025-05-07T20:31:35.8776286Z     @given(
2025-05-07T20:31:35.8776746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.8777383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.8778322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.8779012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.8779622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.8779919Z     )
2025-05-07T20:31:35.8780271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.8780728Z     def test_silu_mul_quant(
2025-05-07T20:31:35.8780982Z         self,
2025-05-07T20:31:35.8781251Z         T: int,
2025-05-07T20:31:35.8781461Z         D: int,
2025-05-07T20:31:35.8781691Z         scale_ub: Optional[float],
2025-05-07T20:31:35.8781976Z         contiguous: bool,
2025-05-07T20:31:35.8782220Z         compiled: bool,
2025-05-07T20:31:35.8782456Z     ) -> None:
2025-05-07T20:31:35.8782683Z         torch.manual_seed(2025)
2025-05-07T20:31:35.8782930Z     
2025-05-07T20:31:35.8783214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.8783566Z     
2025-05-07T20:31:35.8783764Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.8784079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.8784401Z         x = x_sign * x_clamp
2025-05-07T20:31:35.8784645Z         x0 = x[:, :D]
2025-05-07T20:31:35.8784875Z         x1 = x[:, D:]
2025-05-07T20:31:35.8785091Z     
2025-05-07T20:31:35.8785285Z         if contiguous:
2025-05-07T20:31:35.8785528Z             x0 = x0.contiguous()
2025-05-07T20:31:35.8785801Z             x1 = x1.contiguous()
2025-05-07T20:31:35.8786047Z     
2025-05-07T20:31:35.8786251Z         if scale_ub is not None:
2025-05-07T20:31:35.8786533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.8786880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.8787193Z             )
2025-05-07T20:31:35.8787395Z         else:
2025-05-07T20:31:35.8787615Z             scale_ub_tensor = None
2025-05-07T20:31:35.8787869Z     
2025-05-07T20:31:35.8788118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.8788610Z             op = silu_mul_quant
2025-05-07T20:31:35.8788873Z             if compiled:
2025-05-07T20:31:35.8789131Z                 op = torch.compile(op)
2025-05-07T20:31:35.8789439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.8789719Z     
2025-05-07T20:31:35.8789919Z         y_fp8, y_scale = fn()
2025-05-07T20:31:35.8790222Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:35.8790519Z     
2025-05-07T20:31:35.8790766Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.8791113Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:35.8791411Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:35.8791739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:35.8792116Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.8792436Z     
2025-05-07T20:31:35.8792642Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:35.8792851Z 
2025-05-07T20:31:35.8792966Z moe/activation_test.py:126: 
2025-05-07T20:31:35.8793281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8793621Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:35.8793960Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.8794763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:35.8795544Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:35.8796100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.8796789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.8797485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:35.8798294Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.8799062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:35.8799837Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.8800604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:35.8801249Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:35.8801862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:35.8802391Z     fn()
2025-05-07T20:31:35.8802913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:35.8803494Z     self.fn.run(
2025-05-07T20:31:35.8803970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.8804521Z     kernel = self.compile(
2025-05-07T20:31:35.8805069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.8805728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.8806135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8806372Z 
2025-05-07T20:31:35.8806592Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d617220>
2025-05-07T20:31:35.8807677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.8809056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317abe5160>}
2025-05-07T20:31:35.8810558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.8811607Z context = <triton._C.libtriton.ir.context object at 0x7f2fd93168f0>
2025-05-07T20:31:35.8811902Z 
2025-05-07T20:31:35.8812081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.8812613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.8813092Z                            module_map=module_map)
2025-05-07T20:31:35.8813474Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.8813835Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:35.8814115Z E       ^
2025-05-07T20:31:35.8814585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.8815046Z 
2025-05-07T20:31:35.8815470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.8815998Z 
2025-05-07T20:31:35.8816106Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.8816535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.8816949Z     T=128,
2025-05-07T20:31:35.8817139Z     D=7168,
2025-05-07T20:31:35.8817346Z     scale_ub=None,
2025-05-07T20:31:35.8817575Z     contiguous=False,
2025-05-07T20:31:35.8817809Z     compiled=False,
2025-05-07T20:31:35.8818032Z )
2025-05-07T20:31:36.1332838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.1333361Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.1333649Z 
2025-05-07T20:31:36.1334706Z     @given(
2025-05-07T20:31:36.1335460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.1336967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.1337714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.1338408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.1339087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.1339631Z     )
2025-05-07T20:31:36.1340042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.1340820Z     def test_silu_mul_quant(
2025-05-07T20:31:36.1341068Z         self,
2025-05-07T20:31:36.1341335Z         T: int,
2025-05-07T20:31:36.1341547Z         D: int,
2025-05-07T20:31:36.1341769Z         scale_ub: Optional[float],
2025-05-07T20:31:36.1342052Z         contiguous: bool,
2025-05-07T20:31:36.1342306Z         compiled: bool,
2025-05-07T20:31:36.1342537Z     ) -> None:
2025-05-07T20:31:36.1342762Z         torch.manual_seed(2025)
2025-05-07T20:31:36.1343014Z     
2025-05-07T20:31:36.1343298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.1343655Z     
2025-05-07T20:31:36.1343856Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.1344148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.1344467Z         x = x_sign * x_clamp
2025-05-07T20:31:36.1344719Z         x0 = x[:, :D]
2025-05-07T20:31:36.1344980Z         x1 = x[:, D:]
2025-05-07T20:31:36.1345217Z     
2025-05-07T20:31:36.1345410Z         if contiguous:
2025-05-07T20:31:36.1345651Z             x0 = x0.contiguous()
2025-05-07T20:31:36.1354881Z             x1 = x1.contiguous()
2025-05-07T20:31:36.1355147Z     
2025-05-07T20:31:36.1355351Z         if scale_ub is not None:
2025-05-07T20:31:36.1355638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.1355981Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.1356302Z             )
2025-05-07T20:31:36.1356504Z         else:
2025-05-07T20:31:36.1356716Z             scale_ub_tensor = None
2025-05-07T20:31:36.1356977Z     
2025-05-07T20:31:36.1357462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.1357790Z             op = silu_mul_quant
2025-05-07T20:31:36.1358057Z             if compiled:
2025-05-07T20:31:36.1358318Z                 op = torch.compile(op)
2025-05-07T20:31:36.1358616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.1358901Z     
2025-05-07T20:31:36.1359102Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.1359275Z 
2025-05-07T20:31:36.1359388Z moe/activation_test.py:117: 
2025-05-07T20:31:36.1359686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1360032Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.1360321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.1361012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.1361714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.1362271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.1362963Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.1363626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.1364172Z     kernel = self.compile(
2025-05-07T20:31:36.1364727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.1365384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.1365797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1366040Z 
2025-05-07T20:31:36.1366250Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd92b2e20>
2025-05-07T20:31:36.1367470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.1368960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c07bdc0>}
2025-05-07T20:31:36.1370370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.1371400Z context = <triton._C.libtriton.ir.context object at 0x7f313c0b0770>
2025-05-07T20:31:36.1371695Z 
2025-05-07T20:31:36.1371865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.1372400Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.1372866Z                            module_map=module_map)
2025-05-07T20:31:36.1373253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.1373617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.1373884Z E       ^
2025-05-07T20:31:36.1374349Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.1374815Z 
2025-05-07T20:31:36.1375227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.1375745Z 
2025-05-07T20:31:36.1375857Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.1376298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.1376867Z     T=4096,
2025-05-07T20:31:36.1377138Z     D=5120,
2025-05-07T20:31:36.1377417Z     scale_ub=1200.0,
2025-05-07T20:31:36.1377730Z     contiguous=True,
2025-05-07T20:31:36.1378018Z     compiled=False,
2025-05-07T20:31:36.1378243Z )
2025-05-07T20:31:36.1378670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.1379176Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:36.1379452Z 
2025-05-07T20:31:36.1379539Z     @given(
2025-05-07T20:31:36.1379769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.1380091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.1380410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.1380755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.1381084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.1381474Z     )
2025-05-07T20:31:36.1381834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.1382275Z     def test_silu_mul_quant(
2025-05-07T20:31:36.1382523Z         self,
2025-05-07T20:31:36.1382727Z         T: int,
2025-05-07T20:31:36.1382924Z         D: int,
2025-05-07T20:31:36.1383151Z         scale_ub: Optional[float],
2025-05-07T20:31:36.1383440Z         contiguous: bool,
2025-05-07T20:31:36.1383680Z         compiled: bool,
2025-05-07T20:31:36.1383910Z     ) -> None:
2025-05-07T20:31:36.1384131Z         torch.manual_seed(2025)
2025-05-07T20:31:36.1384373Z     
2025-05-07T20:31:36.1384653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.1385002Z     
2025-05-07T20:31:36.1385194Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.1385488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.1385805Z         x = x_sign * x_clamp
2025-05-07T20:31:36.1386052Z         x0 = x[:, :D]
2025-05-07T20:31:36.1386268Z         x1 = x[:, D:]
2025-05-07T20:31:36.1386480Z     
2025-05-07T20:31:36.1386672Z         if contiguous:
2025-05-07T20:31:36.1386904Z             x0 = x0.contiguous()
2025-05-07T20:31:36.1387173Z             x1 = x1.contiguous()
2025-05-07T20:31:36.1387421Z     
2025-05-07T20:31:36.1387611Z         if scale_ub is not None:
2025-05-07T20:31:36.1387990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.1388342Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.1388656Z             )
2025-05-07T20:31:36.1388861Z         else:
2025-05-07T20:31:36.1389083Z             scale_ub_tensor = None
2025-05-07T20:31:36.1389338Z     
2025-05-07T20:31:36.1389583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.1389953Z             op = silu_mul_quant
2025-05-07T20:31:36.1390215Z             if compiled:
2025-05-07T20:31:36.1390479Z                 op = torch.compile(op)
2025-05-07T20:31:36.1390790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.1391076Z     
2025-05-07T20:31:36.1391273Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.1391454Z 
2025-05-07T20:31:36.1391558Z moe/activation_test.py:117: 
2025-05-07T20:31:36.1391870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1392207Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.1392515Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.1393229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.1393918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.1394478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.1395174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.1395856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.1396399Z     kernel = self.compile(
2025-05-07T20:31:36.1396953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.1397618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.1398124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.1398358Z 
2025-05-07T20:31:36.1398572Z self = <triton.compiler.compiler.ASTSource object at 0x7f317a3d0d30>
2025-05-07T20:31:36.1399656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.1401090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313d5650d0>}
2025-05-07T20:31:36.1402437Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.1403463Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8ea2a70>
2025-05-07T20:31:36.1403763Z 
2025-05-07T20:31:36.1403932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.1404465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.1404938Z                            module_map=module_map)
2025-05-07T20:31:36.1405314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.1405666Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.1405937Z E       ^
2025-05-07T20:31:36.1406404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.1406852Z 
2025-05-07T20:31:36.1407267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.1407785Z 
2025-05-07T20:31:36.1407890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.1408389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.1408803Z     T=1,
2025-05-07T20:31:36.1408986Z     D=5120,
2025-05-07T20:31:36.1409184Z     scale_ub=None,
2025-05-07T20:31:36.1409403Z     contiguous=True,
2025-05-07T20:31:36.1409626Z     compiled=True,
2025-05-07T20:31:36.1409841Z )
2025-05-07T20:31:36.6748815Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.6750264Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.6751628Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.6753106Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.6754512Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.6755892Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6757200Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.6758583Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6760462Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.6761721Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.6762952Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.6764174Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.6765232Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:36.6766277Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.6767515Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.6768812Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.6769936Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:36.6771143Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.6772349Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.6773711Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.6774789Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6775708Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6776462Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.6777510Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.8645773Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.8647016Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.8648369Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.8649837Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.8651542Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.8652943Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.8654272Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.8655656Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.8657094Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.8658343Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.8659588Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.8660818Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.8662103Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:36.8663156Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.8664389Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.8665700Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.8666843Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:36.8667904Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.8669116Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.8670487Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.8671571Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.8672495Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.8673255Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.8674294Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.3732200Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.3733459Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.3734824Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.3736285Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.3737725Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.3739130Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.3740722Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.3742177Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.3743927Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.3745206Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.3746444Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.3747658Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.3748709Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:37.3749743Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.3751024Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.3752332Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.3753468Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:37.3754521Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.3755700Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.3757223Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.3758314Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.3759245Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.3760010Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.3761090Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.4128158Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.4129447Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.4130983Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.4132429Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.4134002Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.4135421Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.4136752Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.4138155Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.4139598Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.4141120Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.4142698Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.4143930Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.4144990Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:37.4146037Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.4147448Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.4148747Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.4149869Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:37.4150932Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.4152133Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.4153497Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.4154572Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.4155497Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.4156248Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.4157828Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7610169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.7611159Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.7611516Z 
2025-05-07T20:31:37.7611635Z     @given(
2025-05-07T20:31:37.7611943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.7612350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.7612676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.7613012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.7613355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.7613656Z     )
2025-05-07T20:31:37.7614022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.7614477Z     def test_silu_mul_quant(
2025-05-07T20:31:37.7614733Z         self,
2025-05-07T20:31:37.7614941Z         T: int,
2025-05-07T20:31:37.7615174Z         D: int,
2025-05-07T20:31:37.7615423Z         scale_ub: Optional[float],
2025-05-07T20:31:37.7615711Z         contiguous: bool,
2025-05-07T20:31:37.7615957Z         compiled: bool,
2025-05-07T20:31:37.7616201Z     ) -> None:
2025-05-07T20:31:37.7616432Z         torch.manual_seed(2025)
2025-05-07T20:31:37.7616682Z     
2025-05-07T20:31:37.7616971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.7617330Z     
2025-05-07T20:31:37.7617528Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.7617837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.7618164Z         x = x_sign * x_clamp
2025-05-07T20:31:37.7618411Z         x0 = x[:, :D]
2025-05-07T20:31:37.7618645Z         x1 = x[:, D:]
2025-05-07T20:31:37.7618870Z     
2025-05-07T20:31:37.7619062Z         if contiguous:
2025-05-07T20:31:37.7619308Z             x0 = x0.contiguous()
2025-05-07T20:31:37.7619584Z             x1 = x1.contiguous()
2025-05-07T20:31:37.7619843Z     
2025-05-07T20:31:37.7620460Z         if scale_ub is not None:
2025-05-07T20:31:37.7620777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.7621241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.7621581Z             )
2025-05-07T20:31:37.7621790Z         else:
2025-05-07T20:31:37.7622016Z             scale_ub_tensor = None
2025-05-07T20:31:37.7622273Z     
2025-05-07T20:31:37.7622520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7622851Z             op = silu_mul_quant
2025-05-07T20:31:37.7623108Z             if compiled:
2025-05-07T20:31:37.7623371Z                 op = torch.compile(op)
2025-05-07T20:31:37.7623680Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7623961Z     
2025-05-07T20:31:37.7624165Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.7624464Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.7624761Z     
2025-05-07T20:31:37.7625011Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7625367Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.7625678Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.7625999Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.7626370Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.7626702Z     
2025-05-07T20:31:37.7626919Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.7627119Z 
2025-05-07T20:31:37.7627228Z moe/activation_test.py:126: 
2025-05-07T20:31:37.7627543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7627895Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.7628237Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.7629036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.7630157Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.7630751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.7631449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.7632155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.7632891Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.7633656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.7634406Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.7635144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.7635801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.7636430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.7636952Z     fn()
2025-05-07T20:31:37.7637468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.7638055Z     self.fn.run(
2025-05-07T20:31:37.7638530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.7639069Z     kernel = self.compile(
2025-05-07T20:31:37.7639620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.7640691Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.7641102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7641346Z 
2025-05-07T20:31:37.7641566Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d533760>
2025-05-07T20:31:37.7642805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.7644207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313d565280>}
2025-05-07T20:31:37.7645544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.7646576Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8e91eb0>
2025-05-07T20:31:37.7646878Z 
2025-05-07T20:31:37.7647052Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.7647594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.7648069Z                            module_map=module_map)
2025-05-07T20:31:37.7648453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.7648828Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.7649110Z E       ^
2025-05-07T20:31:37.7649578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7650038Z 
2025-05-07T20:31:37.7650461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.7650980Z 
2025-05-07T20:31:37.7651097Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.7651513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.7651927Z     T=2048,
2025-05-07T20:31:37.7652128Z     D=5120,
2025-05-07T20:31:37.7652335Z     scale_ub=None,
2025-05-07T20:31:37.7652675Z     contiguous=True,
2025-05-07T20:31:37.7652915Z     compiled=True,
2025-05-07T20:31:37.7653133Z )
2025-05-07T20:31:38.2682168Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.2683443Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.2684811Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.2686260Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.2687671Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.2689079Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.2690400Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.2691799Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.2693229Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.2694709Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.2696109Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.2697336Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.2698405Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.2699463Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.2700714Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.2702097Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.2703223Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.2704285Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.2705629Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.2707014Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.2708112Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.2716596Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.2717361Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.2718407Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.4578620Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.4579804Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.4581244Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.4582702Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.4584131Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.4585895Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.4587221Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.4588625Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.4590060Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.4591332Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.4592557Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.4593773Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.4594827Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.4595994Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.4597229Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.4598525Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.4599646Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.4600706Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.4601933Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.4603304Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.4604377Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.4605304Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.4606062Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.4607083Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.9662123Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.9663230Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.9664587Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.9666044Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.9667443Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.9668842Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.9670148Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.9671541Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.9673303Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.9674563Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.9675793Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.9677016Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.9678067Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.9679095Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.9680331Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.9681681Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.9682804Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.9683863Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.9685059Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.9686577Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.9687652Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.9688581Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.9689341Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.9690375Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.0054441Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.0055612Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:39.0056948Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.0058386Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.0059962Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.0061418Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.0062749Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.0064135Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.0065553Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.0066820Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:39.0068030Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.0069256Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:39.0070296Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:39.0071370Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:39.0072629Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.0074044Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.0075179Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:39.0076245Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:39.0077424Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.0078814Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.0079891Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.0080815Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.0081579Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:39.0082622Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5039748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.5041225Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.5041612Z 
2025-05-07T20:31:39.5041719Z     @given(
2025-05-07T20:31:39.5042028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.5042415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.5042734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.5043076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.5043408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.5043705Z     )
2025-05-07T20:31:39.5044070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.5044525Z     def test_silu_mul_quant(
2025-05-07T20:31:39.5044772Z         self,
2025-05-07T20:31:39.5044979Z         T: int,
2025-05-07T20:31:39.5045188Z         D: int,
2025-05-07T20:31:39.5045411Z         scale_ub: Optional[float],
2025-05-07T20:31:39.5045693Z         contiguous: bool,
2025-05-07T20:31:39.5045962Z         compiled: bool,
2025-05-07T20:31:39.5046204Z     ) -> None:
2025-05-07T20:31:39.5046433Z         torch.manual_seed(2025)
2025-05-07T20:31:39.5046692Z     
2025-05-07T20:31:39.5046971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.5047327Z     
2025-05-07T20:31:39.5047527Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.5047832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.5048161Z         x = x_sign * x_clamp
2025-05-07T20:31:39.5048412Z         x0 = x[:, :D]
2025-05-07T20:31:39.5048639Z         x1 = x[:, D:]
2025-05-07T20:31:39.5048853Z     
2025-05-07T20:31:39.5049048Z         if contiguous:
2025-05-07T20:31:39.5049292Z             x0 = x0.contiguous()
2025-05-07T20:31:39.5049559Z             x1 = x1.contiguous()
2025-05-07T20:31:39.5049811Z     
2025-05-07T20:31:39.5050011Z         if scale_ub is not None:
2025-05-07T20:31:39.5050289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.5050640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.5051138Z             )
2025-05-07T20:31:39.5051333Z         else:
2025-05-07T20:31:39.5051552Z             scale_ub_tensor = None
2025-05-07T20:31:39.5051811Z     
2025-05-07T20:31:39.5052046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5052371Z             op = silu_mul_quant
2025-05-07T20:31:39.5052636Z             if compiled:
2025-05-07T20:31:39.5052888Z                 op = torch.compile(op)
2025-05-07T20:31:39.5053196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5053483Z     
2025-05-07T20:31:39.5053687Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.5053974Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.5054276Z     
2025-05-07T20:31:39.5054521Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5054860Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.5055164Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.5055503Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.5055867Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.5056185Z     
2025-05-07T20:31:39.5056396Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.5056593Z 
2025-05-07T20:31:39.5056699Z moe/activation_test.py:126: 
2025-05-07T20:31:39.5057011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5057359Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.5057693Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.5058482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.5059243Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.5059801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.5060663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.5061516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.5062250Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.5063009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.5063754Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.5064484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.5065135Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.5065749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.5066279Z     fn()
2025-05-07T20:31:39.5066798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.5067385Z     self.fn.run(
2025-05-07T20:31:39.5067851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.5068387Z     kernel = self.compile(
2025-05-07T20:31:39.5068941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.5069600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.5070001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5070245Z 
2025-05-07T20:31:39.5070457Z self = <triton.compiler.compiler.ASTSource object at 0x7f313c090bb0>
2025-05-07T20:31:39.5071590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.5073065Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd947f0d0>}
2025-05-07T20:31:39.5074407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.5075422Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8c32f30>
2025-05-07T20:31:39.5075718Z 
2025-05-07T20:31:39.5075890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.5076420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.5076888Z                            module_map=module_map)
2025-05-07T20:31:39.5077496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.5077860Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.5078134Z E       ^
2025-05-07T20:31:39.5078602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5079054Z 
2025-05-07T20:31:39.5079471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.5079982Z 
2025-05-07T20:31:39.5080092Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.5080513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.5080915Z     T=128,
2025-05-07T20:31:39.5081108Z     D=5120,
2025-05-07T20:31:39.5081310Z     scale_ub=None,
2025-05-07T20:31:39.5081527Z     contiguous=True,
2025-05-07T20:31:39.5081762Z     compiled=True,
2025-05-07T20:31:39.5081981Z )
2025-05-07T20:31:40.0378415Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.0379532Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.0380880Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.0382436Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.0383850Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.0385250Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.0386568Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.0387973Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.0389407Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.0390825Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.0392054Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.0393270Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.0394326Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.0395363Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.0396614Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.0397906Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.0399021Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.0400072Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.0401265Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.0402799Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.0403870Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.0404788Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.0405537Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.0406572Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.2286011Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.2287128Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.2288475Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.2289931Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.2291321Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.2293102Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.2294412Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.2295799Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.2297237Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.2298505Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.2299736Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.2300945Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.2302066Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.2303105Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.2304489Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.2305805Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.2306926Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.2307964Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.2309153Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.2310514Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.2311587Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.2312502Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.2313255Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.2314296Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.7413332Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.7414831Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.7416173Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.7417610Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.7419000Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.7420404Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.7421812Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.7423200Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.7424632Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.7426038Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.7427288Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.7428508Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.7429563Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.7430592Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.7431822Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.7433128Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.7434254Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.7435311Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.7436500Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.7437877Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.7439028Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.7439959Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.7440967Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.7442057Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.7810243Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.7811974Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.7813315Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.7814734Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.7816136Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.7817677Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.7819012Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.7820405Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.7821929Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.7823190Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.7824434Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.7825645Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.7826691Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.7827722Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.7828964Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.7830422Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.7831551Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.7832600Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.7833778Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.7835134Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.7836213Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.7837140Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.7837888Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.7838914Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2379497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2380241Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:41.2380626Z 
2025-05-07T20:31:41.2380747Z     @given(
2025-05-07T20:31:41.2381609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2382041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2382389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2382737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2383077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2383380Z     )
2025-05-07T20:31:41.2383747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2384215Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2384466Z         self,
2025-05-07T20:31:41.2384673Z         T: int,
2025-05-07T20:31:41.2384884Z         D: int,
2025-05-07T20:31:41.2385108Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2385393Z         contiguous: bool,
2025-05-07T20:31:41.2385646Z         compiled: bool,
2025-05-07T20:31:41.2385880Z     ) -> None:
2025-05-07T20:31:41.2386113Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2386375Z     
2025-05-07T20:31:41.2386661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2387020Z     
2025-05-07T20:31:41.2387227Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2387551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2387870Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2388123Z         x0 = x[:, :D]
2025-05-07T20:31:41.2388351Z         x1 = x[:, D:]
2025-05-07T20:31:41.2388562Z     
2025-05-07T20:31:41.2388760Z         if contiguous:
2025-05-07T20:31:41.2389006Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2389275Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2389532Z     
2025-05-07T20:31:41.2389735Z         if scale_ub is not None:
2025-05-07T20:31:41.2390026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2390367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2390689Z             )
2025-05-07T20:31:41.2390896Z         else:
2025-05-07T20:31:41.2391310Z             scale_ub_tensor = None
2025-05-07T20:31:41.2391573Z     
2025-05-07T20:31:41.2391816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2392136Z             op = silu_mul_quant
2025-05-07T20:31:41.2392406Z             if compiled:
2025-05-07T20:31:41.2392667Z                 op = torch.compile(op)
2025-05-07T20:31:41.2392972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2393266Z     
2025-05-07T20:31:41.2393471Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.2393766Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.2394077Z     
2025-05-07T20:31:41.2394328Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2394679Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.2394980Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.2395307Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.2395677Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2396003Z     
2025-05-07T20:31:41.2396220Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.2396419Z 
2025-05-07T20:31:41.2396531Z moe/activation_test.py:126: 
2025-05-07T20:31:41.2396836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2397185Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.2397523Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.2398320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.2399072Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.2399640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2400332Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2401124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.2401921Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2402679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.2403431Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.2404154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.2404808Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.2405418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.2405945Z     fn()
2025-05-07T20:31:41.2406461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.2407049Z     self.fn.run(
2025-05-07T20:31:41.2407527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2408060Z     kernel = self.compile(
2025-05-07T20:31:41.2408611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2409271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2409676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2409930Z 
2025-05-07T20:31:41.2417988Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d535250>
2025-05-07T20:31:41.2419149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2422845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd90d25e0>}
2025-05-07T20:31:41.2424219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2425256Z context = <triton._C.libtriton.ir.context object at 0x7f2fd868d0b0>
2025-05-07T20:31:41.2425559Z 
2025-05-07T20:31:41.2425737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2426273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2426762Z                            module_map=module_map)
2025-05-07T20:31:41.2427137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2427512Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.2427812Z E       ^
2025-05-07T20:31:41.2428284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2428743Z 
2025-05-07T20:31:41.2429162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2429684Z 
2025-05-07T20:31:41.2429792Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2430219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2430625Z     T=4096,
2025-05-07T20:31:41.2430831Z     D=5120,
2025-05-07T20:31:41.2431036Z     scale_ub=None,
2025-05-07T20:31:41.2431259Z     contiguous=True,
2025-05-07T20:31:41.2431494Z     compiled=True,
2025-05-07T20:31:41.2431716Z )
2025-05-07T20:31:41.7703134Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.7704295Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.7705642Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.7707093Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.7708494Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.7709890Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.7711225Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.7712602Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.7714020Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.7715292Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.7716676Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.7717899Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.7718937Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:41.7719965Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.7721195Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.7722499Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.7723623Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:41.7724672Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.7725857Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.7727304Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.7728401Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.7729324Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.7730077Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.7731110Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.9625771Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.9627210Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.9628578Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.9630012Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.9631381Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.9632817Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.9634488Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.9635878Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.9637311Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.9638564Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.9639798Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.9641341Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.9642400Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:41.9643438Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.9644665Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.9646120Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.9647249Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:41.9648310Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.9649495Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.9650859Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.9651949Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.9652869Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.9653628Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.9654657Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4746581Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.4747843Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.4749774Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.4751237Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.4752627Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.4753998Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4755338Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.4756730Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4758138Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.4759375Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.4760742Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.4761968Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.4763005Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:42.4764037Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.4765271Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.4766567Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.4767705Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:42.4768784Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.4770488Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.4771998Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.4773074Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4774098Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4774849Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.4775889Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.5138400Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.5139636Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.5141341Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.5142839Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.5144235Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.5145627Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.5147129Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.5148521Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.5149956Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.5151213Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.5152454Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.5153679Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.5154724Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:42.5155763Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.5156984Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.5158266Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.5159507Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:42.5160538Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.5161724Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.5163082Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.5164158Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.5165085Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.5165833Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.5166866Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.9677522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.9678320Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.9678687Z 
2025-05-07T20:31:42.9678776Z     @given(
2025-05-07T20:31:42.9679023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.9679346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.9680114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.9680468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.9680818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.9681110Z     )
2025-05-07T20:31:42.9681471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.9681927Z     def test_silu_mul_quant(
2025-05-07T20:31:42.9682176Z         self,
2025-05-07T20:31:42.9682384Z         T: int,
2025-05-07T20:31:42.9682597Z         D: int,
2025-05-07T20:31:42.9682820Z         scale_ub: Optional[float],
2025-05-07T20:31:42.9683109Z         contiguous: bool,
2025-05-07T20:31:42.9683364Z         compiled: bool,
2025-05-07T20:31:42.9683598Z     ) -> None:
2025-05-07T20:31:42.9683832Z         torch.manual_seed(2025)
2025-05-07T20:31:42.9684090Z     
2025-05-07T20:31:42.9684368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.9684726Z     
2025-05-07T20:31:42.9684945Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.9685246Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.9685571Z         x = x_sign * x_clamp
2025-05-07T20:31:42.9685826Z         x0 = x[:, :D]
2025-05-07T20:31:42.9686057Z         x1 = x[:, D:]
2025-05-07T20:31:42.9686269Z     
2025-05-07T20:31:42.9686469Z         if contiguous:
2025-05-07T20:31:42.9686718Z             x0 = x0.contiguous()
2025-05-07T20:31:42.9686988Z             x1 = x1.contiguous()
2025-05-07T20:31:42.9687241Z     
2025-05-07T20:31:42.9687447Z         if scale_ub is not None:
2025-05-07T20:31:42.9687733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.9688086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.9688414Z             )
2025-05-07T20:31:42.9688614Z         else:
2025-05-07T20:31:42.9688840Z             scale_ub_tensor = None
2025-05-07T20:31:42.9689106Z     
2025-05-07T20:31:42.9689345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.9689850Z             op = silu_mul_quant
2025-05-07T20:31:42.9690123Z             if compiled:
2025-05-07T20:31:42.9690379Z                 op = torch.compile(op)
2025-05-07T20:31:42.9690691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9690980Z     
2025-05-07T20:31:42.9691191Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.9691486Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.9691789Z     
2025-05-07T20:31:42.9692036Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.9692403Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.9692740Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.9693071Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.9693435Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.9693758Z     
2025-05-07T20:31:42.9693971Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.9694174Z 
2025-05-07T20:31:42.9694292Z moe/activation_test.py:126: 
2025-05-07T20:31:42.9694603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9694951Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.9695295Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.9696083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.9696844Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.9697407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.9698098Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.9698792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.9699641Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.9700414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.9701256Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.9702006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.9702662Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.9703277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.9703800Z     fn()
2025-05-07T20:31:42.9704324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.9704918Z     self.fn.run(
2025-05-07T20:31:42.9705398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.9705957Z     kernel = self.compile(
2025-05-07T20:31:42.9706514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.9707188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.9707592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9707834Z 
2025-05-07T20:31:42.9708050Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd8ab5100>
2025-05-07T20:31:42.9709143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.9710562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd91a5940>}
2025-05-07T20:31:42.9712000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.9713031Z context = <triton._C.libtriton.ir.context object at 0x7f2fd80781f0>
2025-05-07T20:31:42.9713334Z 
2025-05-07T20:31:42.9713507Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.9714049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.9714520Z                            module_map=module_map)
2025-05-07T20:31:42.9714905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.9715275Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.9715555Z E       ^
2025-05-07T20:31:42.9716022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.9716495Z 
2025-05-07T20:31:42.9716914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.9717430Z 
2025-05-07T20:31:42.9717549Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.9717978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.9718387Z     T=16384,
2025-05-07T20:31:42.9718595Z     D=5120,
2025-05-07T20:31:42.9718804Z     scale_ub=None,
2025-05-07T20:31:42.9719024Z     contiguous=True,
2025-05-07T20:31:42.9719259Z     compiled=True,
2025-05-07T20:31:42.9719480Z )
2025-05-07T20:31:43.0171488Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:43.0173764Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:43.0176436Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:43.0178409Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:43.0180617Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:43.1391978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.1392799Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:43.1393211Z 
2025-05-07T20:31:43.1393310Z     @given(
2025-05-07T20:31:43.1393569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.1393944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.1394261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.1394610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.1394956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.1395250Z     )
2025-05-07T20:31:43.1395614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.1396071Z     def test_silu_mul_quant(
2025-05-07T20:31:43.1396324Z         self,
2025-05-07T20:31:43.1396534Z         T: int,
2025-05-07T20:31:43.1396749Z         D: int,
2025-05-07T20:31:43.1396980Z         scale_ub: Optional[float],
2025-05-07T20:31:43.1397272Z         contiguous: bool,
2025-05-07T20:31:43.1397526Z         compiled: bool,
2025-05-07T20:31:43.1397764Z     ) -> None:
2025-05-07T20:31:43.1397997Z         torch.manual_seed(2025)
2025-05-07T20:31:43.1398257Z     
2025-05-07T20:31:43.1398544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.1399256Z     
2025-05-07T20:31:43.1399467Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.1399777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.1400100Z         x = x_sign * x_clamp
2025-05-07T20:31:43.1400357Z         x0 = x[:, :D]
2025-05-07T20:31:43.1400587Z         x1 = x[:, D:]
2025-05-07T20:31:43.1400802Z     
2025-05-07T20:31:43.1401005Z         if contiguous:
2025-05-07T20:31:43.1401254Z             x0 = x0.contiguous()
2025-05-07T20:31:43.1401523Z             x1 = x1.contiguous()
2025-05-07T20:31:43.1401781Z     
2025-05-07T20:31:43.1402018Z         if scale_ub is not None:
2025-05-07T20:31:43.1402327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.1402680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.1403008Z             )
2025-05-07T20:31:43.1403209Z         else:
2025-05-07T20:31:43.1403432Z             scale_ub_tensor = None
2025-05-07T20:31:43.1403710Z     
2025-05-07T20:31:43.1403954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.1404286Z             op = silu_mul_quant
2025-05-07T20:31:43.1404553Z             if compiled:
2025-05-07T20:31:43.1404814Z                 op = torch.compile(op)
2025-05-07T20:31:43.1405117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.1405405Z     
2025-05-07T20:31:43.1405611Z         y_fp8, y_scale = fn()
2025-05-07T20:31:43.1405903Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:43.1406210Z     
2025-05-07T20:31:43.1406459Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.1406805Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:43.1407113Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:43.1407441Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:43.1407810Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.1408140Z     
2025-05-07T20:31:43.1409168Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:43.1409377Z 
2025-05-07T20:31:43.1409491Z moe/activation_test.py:126: 
2025-05-07T20:31:43.1409802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.1410156Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:43.1410496Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.1411302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:43.1412070Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:43.1412625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.1413323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.1414031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:43.1414762Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.1415526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:43.1416285Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.1417025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:43.1417672Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:43.1418288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:43.1418818Z     fn()
2025-05-07T20:31:43.1419334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:43.1419918Z     self.fn.run(
2025-05-07T20:31:43.1420499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.1421041Z     kernel = self.compile(
2025-05-07T20:31:43.1421685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.1422349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.1422762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.1422997Z 
2025-05-07T20:31:43.1423215Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd811f5e0>
2025-05-07T20:31:43.1424292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.1425675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d56940>}
2025-05-07T20:31:43.1427033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.1428058Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7a448b0>
2025-05-07T20:31:43.1428351Z 
2025-05-07T20:31:43.1428529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.1429057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.1429534Z                            module_map=module_map)
2025-05-07T20:31:43.1429913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.1430272Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:43.1430550Z E       ^
2025-05-07T20:31:43.1431115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.1431576Z 
2025-05-07T20:31:43.1432000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.1432517Z 
2025-05-07T20:31:43.1432625Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.1433048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.1433459Z     T=1,
2025-05-07T20:31:43.1433647Z     D=5120,
2025-05-07T20:31:43.1433851Z     scale_ub=1200.0,
2025-05-07T20:31:43.1434083Z     contiguous=True,
2025-05-07T20:31:43.1434309Z     compiled=True,
2025-05-07T20:31:43.1434525Z )
2025-05-07T20:31:43.5278667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5279422Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.5279793Z 
2025-05-07T20:31:43.5279962Z     @given(
2025-05-07T20:31:43.5280292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5280723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5281133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5281527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5281869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5282176Z     )
2025-05-07T20:31:43.5282532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5282995Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5283258Z         self,
2025-05-07T20:31:43.5283460Z         T: int,
2025-05-07T20:31:43.5283672Z         D: int,
2025-05-07T20:31:43.5283908Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5284192Z         contiguous: bool,
2025-05-07T20:31:43.5284446Z         compiled: bool,
2025-05-07T20:31:43.5284687Z     ) -> None:
2025-05-07T20:31:43.5284911Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5285597Z     
2025-05-07T20:31:43.5285886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5286250Z     
2025-05-07T20:31:43.5286447Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5286758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5287089Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5287337Z         x0 = x[:, :D]
2025-05-07T20:31:43.5287565Z         x1 = x[:, D:]
2025-05-07T20:31:43.5287784Z     
2025-05-07T20:31:43.5287975Z         if contiguous:
2025-05-07T20:31:43.5288219Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5288490Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5296771Z     
2025-05-07T20:31:43.5297016Z         if scale_ub is not None:
2025-05-07T20:31:43.5297315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5297672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5297993Z             )
2025-05-07T20:31:43.5298212Z         else:
2025-05-07T20:31:43.5298438Z             scale_ub_tensor = None
2025-05-07T20:31:43.5298705Z     
2025-05-07T20:31:43.5298956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5299293Z             op = silu_mul_quant
2025-05-07T20:31:43.5299553Z             if compiled:
2025-05-07T20:31:43.5299819Z                 op = torch.compile(op)
2025-05-07T20:31:43.5300129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5300438Z     
2025-05-07T20:31:43.5300633Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5300811Z 
2025-05-07T20:31:43.5300919Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5301297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5301639Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5301946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5302526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5303283Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5303964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5304663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5305209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5305898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5306566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5307118Z     kernel = self.compile(
2025-05-07T20:31:43.5307668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5308333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5308755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5308996Z 
2025-05-07T20:31:43.5309215Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd8955a30>
2025-05-07T20:31:43.5310303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5311673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd82f5ca0>}
2025-05-07T20:31:43.5313018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5314038Z context = <triton._C.libtriton.ir.context object at 0x7f2fd75919b0>
2025-05-07T20:31:43.5314424Z 
2025-05-07T20:31:43.5314605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5315128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5315608Z                            module_map=module_map)
2025-05-07T20:31:43.5315992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5316358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5316622Z E       ^
2025-05-07T20:31:43.5317096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5317553Z 
2025-05-07T20:31:43.5317976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5318487Z 
2025-05-07T20:31:43.5318593Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5319020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5319435Z     T=1,
2025-05-07T20:31:43.5319628Z     D=5120,
2025-05-07T20:31:43.5319821Z     scale_ub=None,
2025-05-07T20:31:43.5320048Z     contiguous=False,
2025-05-07T20:31:43.5320283Z     compiled=True,
2025-05-07T20:31:43.5320493Z )
2025-05-07T20:31:43.6133429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6134893Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.6135644Z 
2025-05-07T20:31:43.6135875Z     @given(
2025-05-07T20:31:43.6136346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6136995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6137632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6138300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6138980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6139567Z     )
2025-05-07T20:31:43.6141047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6142051Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6142508Z         self,
2025-05-07T20:31:43.6142718Z         T: int,
2025-05-07T20:31:43.6142928Z         D: int,
2025-05-07T20:31:43.6143162Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6143450Z         contiguous: bool,
2025-05-07T20:31:43.6143697Z         compiled: bool,
2025-05-07T20:31:43.6143936Z     ) -> None:
2025-05-07T20:31:43.6144167Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6144418Z     
2025-05-07T20:31:43.6144707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6145069Z     
2025-05-07T20:31:43.6145267Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6145576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6145904Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6146153Z         x0 = x[:, :D]
2025-05-07T20:31:43.6146386Z         x1 = x[:, D:]
2025-05-07T20:31:43.6146617Z     
2025-05-07T20:31:43.6146817Z         if contiguous:
2025-05-07T20:31:43.6147058Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6147334Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6147590Z     
2025-05-07T20:31:43.6147786Z         if scale_ub is not None:
2025-05-07T20:31:43.6148070Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6148419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6148742Z             )
2025-05-07T20:31:43.6148948Z         else:
2025-05-07T20:31:43.6149170Z             scale_ub_tensor = None
2025-05-07T20:31:43.6149427Z     
2025-05-07T20:31:43.6149672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6149998Z             op = silu_mul_quant
2025-05-07T20:31:43.6150255Z             if compiled:
2025-05-07T20:31:43.6150520Z                 op = torch.compile(op)
2025-05-07T20:31:43.6150833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6151268Z     
2025-05-07T20:31:43.6151480Z         y_fp8, y_scale = fn()
2025-05-07T20:31:43.6151780Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:43.6152121Z     
2025-05-07T20:31:43.6152385Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6152730Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:43.6153043Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:43.6153368Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:43.6153745Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.6154069Z     
2025-05-07T20:31:43.6154277Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:43.6154485Z 
2025-05-07T20:31:43.6154592Z moe/activation_test.py:126: 
2025-05-07T20:31:43.6154909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6155260Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:43.6155600Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.6156592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:43.6157373Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:43.6157933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6158621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6159319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:43.6160051Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.6160811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:43.6161644Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.6162392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:43.6163040Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:43.6163645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:43.6164187Z     fn()
2025-05-07T20:31:43.6164710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:43.6165292Z     self.fn.run(
2025-05-07T20:31:43.6165759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6166303Z     kernel = self.compile(
2025-05-07T20:31:43.6166852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6167509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6167927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6168168Z 
2025-05-07T20:31:43.6168381Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7577760>
2025-05-07T20:31:43.6169466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6170850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd838f280>}
2025-05-07T20:31:43.6172185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6173320Z context = <triton._C.libtriton.ir.context object at 0x7f2fd754abb0>
2025-05-07T20:31:43.6173620Z 
2025-05-07T20:31:43.6173792Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6174327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6174795Z                            module_map=module_map)
2025-05-07T20:31:43.6175174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6175542Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:43.6175812Z E       ^
2025-05-07T20:31:43.6176281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6176740Z 
2025-05-07T20:31:43.6177163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6177675Z 
2025-05-07T20:31:43.6177788Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6178216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6178626Z     T=1,
2025-05-07T20:31:43.6178821Z     D=5120,
2025-05-07T20:31:43.6179016Z     scale_ub=None,
2025-05-07T20:31:43.6179240Z     contiguous=True,
2025-05-07T20:31:43.6179481Z     compiled=False,
2025-05-07T20:31:43.6179690Z )
2025-05-07T20:31:43.8158882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8159738Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:43.8160123Z 
2025-05-07T20:31:43.8160238Z     @given(
2025-05-07T20:31:43.8160516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8160837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8161155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8161499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8162114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8162442Z     )
2025-05-07T20:31:43.8162805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8163264Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8163511Z         self,
2025-05-07T20:31:43.8163713Z         T: int,
2025-05-07T20:31:43.8163918Z         D: int,
2025-05-07T20:31:43.8164140Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8164422Z         contiguous: bool,
2025-05-07T20:31:43.8164678Z         compiled: bool,
2025-05-07T20:31:43.8164910Z     ) -> None:
2025-05-07T20:31:43.8165145Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8165396Z     
2025-05-07T20:31:43.8165672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8166028Z     
2025-05-07T20:31:43.8166232Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8166531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8166857Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8167122Z         x0 = x[:, :D]
2025-05-07T20:31:43.8167347Z         x1 = x[:, D:]
2025-05-07T20:31:43.8167569Z     
2025-05-07T20:31:43.8167769Z         if contiguous:
2025-05-07T20:31:43.8168017Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8168284Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8168541Z     
2025-05-07T20:31:43.8168744Z         if scale_ub is not None:
2025-05-07T20:31:43.8169023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8169367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8169690Z             )
2025-05-07T20:31:43.8169888Z         else:
2025-05-07T20:31:43.8170109Z             scale_ub_tensor = None
2025-05-07T20:31:43.8170371Z     
2025-05-07T20:31:43.8170610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8170939Z             op = silu_mul_quant
2025-05-07T20:31:43.8171202Z             if compiled:
2025-05-07T20:31:43.8171454Z                 op = torch.compile(op)
2025-05-07T20:31:43.8171963Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8172247Z     
2025-05-07T20:31:43.8172444Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8172622Z 
2025-05-07T20:31:43.8172727Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8173039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8173383Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8173669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8174366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8175064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8175601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8176295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8176971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8177518Z     kernel = self.compile(
2025-05-07T20:31:43.8178058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8178722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8179136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8179373Z 
2025-05-07T20:31:43.8179590Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7d6c1c0>
2025-05-07T20:31:43.8180666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8182318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d56040>}
2025-05-07T20:31:43.8183671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8184693Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7c355b0>
2025-05-07T20:31:43.8184984Z 
2025-05-07T20:31:43.8185156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8185691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8186164Z                            module_map=module_map)
2025-05-07T20:31:43.8186548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8186906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8187181Z E       ^
2025-05-07T20:31:43.8187658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8188113Z 
2025-05-07T20:31:43.8188537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8189059Z 
2025-05-07T20:31:43.8189167Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8189590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8190000Z     T=128,
2025-05-07T20:31:43.8190192Z     D=5120,
2025-05-07T20:31:43.8190396Z     scale_ub=None,
2025-05-07T20:31:43.8190623Z     contiguous=False,
2025-05-07T20:31:43.8190855Z     compiled=True,
2025-05-07T20:31:43.8191073Z )
2025-05-07T20:31:43.8191404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8191899Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.8192179Z 
2025-05-07T20:31:43.8192259Z     @given(
2025-05-07T20:31:43.8192588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8192916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8193226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8193567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8193906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8194194Z     )
2025-05-07T20:31:43.8194551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8195004Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8195247Z         self,
2025-05-07T20:31:43.8195456Z         T: int,
2025-05-07T20:31:43.8195665Z         D: int,
2025-05-07T20:31:43.8195885Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8196168Z         contiguous: bool,
2025-05-07T20:31:43.8196415Z         compiled: bool,
2025-05-07T20:31:43.8196641Z     ) -> None:
2025-05-07T20:31:43.8196866Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8197117Z     
2025-05-07T20:31:43.8197409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8197759Z     
2025-05-07T20:31:43.8197964Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8198265Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8198579Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8198830Z         x0 = x[:, :D]
2025-05-07T20:31:43.8199057Z         x1 = x[:, D:]
2025-05-07T20:31:43.8199269Z     
2025-05-07T20:31:43.8199469Z         if contiguous:
2025-05-07T20:31:43.8199713Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8199978Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8200234Z     
2025-05-07T20:31:43.8200438Z         if scale_ub is not None:
2025-05-07T20:31:43.8200716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8201064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8201387Z             )
2025-05-07T20:31:43.8201583Z         else:
2025-05-07T20:31:43.8201886Z             scale_ub_tensor = None
2025-05-07T20:31:43.8202158Z     
2025-05-07T20:31:43.8202401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8202719Z             op = silu_mul_quant
2025-05-07T20:31:43.8202983Z             if compiled:
2025-05-07T20:31:43.8203242Z                 op = torch.compile(op)
2025-05-07T20:31:43.8203550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8203839Z     
2025-05-07T20:31:43.8204042Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8204210Z 
2025-05-07T20:31:43.8204320Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8204625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8204966Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8205250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8205819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.8206389Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.8207060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8207747Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8208289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8208974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8209656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8210190Z     kernel = self.compile(
2025-05-07T20:31:43.8210739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8211404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8211804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8212139Z 
2025-05-07T20:31:43.8212351Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd83cd970>
2025-05-07T20:31:43.8213432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8214804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd912ac10>}
2025-05-07T20:31:43.8216152Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8217177Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7957e70>
2025-05-07T20:31:43.8217470Z 
2025-05-07T20:31:43.8217644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8218182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8218663Z                            module_map=module_map)
2025-05-07T20:31:43.8219044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8219401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8219674Z E       ^
2025-05-07T20:31:43.8220141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8220589Z 
2025-05-07T20:31:43.8221008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8221606Z 
2025-05-07T20:31:43.8221713Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8222140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8222557Z     T=128,
2025-05-07T20:31:43.8222832Z     D=7168,
2025-05-07T20:31:43.8223038Z     scale_ub=1200.0,
2025-05-07T20:31:43.8223275Z     contiguous=False,
2025-05-07T20:31:43.8223504Z     compiled=False,
2025-05-07T20:31:43.8223722Z )
2025-05-07T20:31:43.9761503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.9762317Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.9762954Z 
2025-05-07T20:31:43.9763176Z     @given(
2025-05-07T20:31:43.9763828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.9764520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.9765157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.9765837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.9766497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.9767080Z     )
2025-05-07T20:31:43.9767822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.9768722Z     def test_silu_mul_quant(
2025-05-07T20:31:43.9769223Z         self,
2025-05-07T20:31:43.9769626Z         T: int,
2025-05-07T20:31:43.9770027Z         D: int,
2025-05-07T20:31:43.9770479Z         scale_ub: Optional[float],
2025-05-07T20:31:43.9771036Z         contiguous: bool,
2025-05-07T20:31:43.9771516Z         compiled: bool,
2025-05-07T20:31:43.9771980Z     ) -> None:
2025-05-07T20:31:43.9772284Z         torch.manual_seed(2025)
2025-05-07T20:31:43.9772539Z     
2025-05-07T20:31:43.9772819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.9773173Z     
2025-05-07T20:31:43.9773378Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.9773676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.9774001Z         x = x_sign * x_clamp
2025-05-07T20:31:43.9774255Z         x0 = x[:, :D]
2025-05-07T20:31:43.9774483Z         x1 = x[:, D:]
2025-05-07T20:31:43.9774706Z     
2025-05-07T20:31:43.9775742Z         if contiguous:
2025-05-07T20:31:43.9775983Z             x0 = x0.contiguous()
2025-05-07T20:31:43.9776259Z             x1 = x1.contiguous()
2025-05-07T20:31:43.9776513Z     
2025-05-07T20:31:43.9776712Z         if scale_ub is not None:
2025-05-07T20:31:43.9777000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.9777356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.9777671Z             )
2025-05-07T20:31:43.9777880Z         else:
2025-05-07T20:31:43.9778104Z             scale_ub_tensor = None
2025-05-07T20:31:43.9778369Z     
2025-05-07T20:31:43.9778610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.9778940Z             op = silu_mul_quant
2025-05-07T20:31:43.9779211Z             if compiled:
2025-05-07T20:31:43.9779470Z                 op = torch.compile(op)
2025-05-07T20:31:43.9779778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9780068Z     
2025-05-07T20:31:43.9780269Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.9780455Z 
2025-05-07T20:31:43.9780563Z moe/activation_test.py:117: 
2025-05-07T20:31:43.9780872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9781346Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.9781638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9782339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.9783086Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.9783631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.9784323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.9784997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.9785534Z     kernel = self.compile(
2025-05-07T20:31:43.9795062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.9795766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.9796181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9796430Z 
2025-05-07T20:31:43.9796649Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd90d1730>
2025-05-07T20:31:43.9797740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.9799148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8467dc0>}
2025-05-07T20:31:43.9800506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.9801550Z context = <triton._C.libtriton.ir.context object at 0x7f2fd73fe530>
2025-05-07T20:31:43.9801854Z 
2025-05-07T20:31:43.9802030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.9802570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.9803055Z                            module_map=module_map)
2025-05-07T20:31:43.9803443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.9803814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.9804091Z E       ^
2025-05-07T20:31:43.9804567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.9805029Z 
2025-05-07T20:31:43.9805462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.9806131Z 
2025-05-07T20:31:43.9806255Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.9806686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.9807093Z     T=128,
2025-05-07T20:31:43.9807298Z     D=5120,
2025-05-07T20:31:43.9807508Z     scale_ub=None,
2025-05-07T20:31:43.9807731Z     contiguous=False,
2025-05-07T20:31:43.9807973Z     compiled=False,
2025-05-07T20:31:43.9808196Z )
2025-05-07T20:31:43.9808520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.9809030Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:43.9809310Z 
2025-05-07T20:31:43.9809402Z     @given(
2025-05-07T20:31:43.9809645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.9809981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.9810327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.9810683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.9811026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.9811329Z     )
2025-05-07T20:31:43.9811694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.9812151Z     def test_silu_mul_quant(
2025-05-07T20:31:43.9812415Z         self,
2025-05-07T20:31:43.9812629Z         T: int,
2025-05-07T20:31:43.9812858Z         D: int,
2025-05-07T20:31:43.9813117Z         scale_ub: Optional[float],
2025-05-07T20:31:43.9813409Z         contiguous: bool,
2025-05-07T20:31:43.9813658Z         compiled: bool,
2025-05-07T20:31:43.9813903Z     ) -> None:
2025-05-07T20:31:43.9814135Z         torch.manual_seed(2025)
2025-05-07T20:31:43.9814388Z     
2025-05-07T20:31:43.9814675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.9815034Z     
2025-05-07T20:31:43.9815328Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.9815642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.9815972Z         x = x_sign * x_clamp
2025-05-07T20:31:43.9816234Z         x0 = x[:, :D]
2025-05-07T20:31:43.9816458Z         x1 = x[:, D:]
2025-05-07T20:31:43.9816673Z     
2025-05-07T20:31:43.9816862Z         if contiguous:
2025-05-07T20:31:43.9817103Z             x0 = x0.contiguous()
2025-05-07T20:31:43.9817376Z             x1 = x1.contiguous()
2025-05-07T20:31:43.9817636Z     
2025-05-07T20:31:43.9817839Z         if scale_ub is not None:
2025-05-07T20:31:43.9818136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.9818491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.9818811Z             )
2025-05-07T20:31:43.9819026Z         else:
2025-05-07T20:31:43.9819257Z             scale_ub_tensor = None
2025-05-07T20:31:43.9819519Z     
2025-05-07T20:31:43.9819765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.9820114Z             op = silu_mul_quant
2025-05-07T20:31:43.9820381Z             if compiled:
2025-05-07T20:31:43.9820652Z                 op = torch.compile(op)
2025-05-07T20:31:43.9820967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9821329Z     
2025-05-07T20:31:43.9821531Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.9821715Z 
2025-05-07T20:31:43.9821821Z moe/activation_test.py:117: 
2025-05-07T20:31:43.9822139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9822482Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.9822809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9823542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.9824238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.9824799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.9825583Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.9826264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.9826817Z     kernel = self.compile(
2025-05-07T20:31:43.9827376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.9828067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.9828482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9828717Z 
2025-05-07T20:31:43.9828929Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7e04220>
2025-05-07T20:31:43.9830020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.9831411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d7b3a0>}
2025-05-07T20:31:43.9832759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.9833777Z context = <triton._C.libtriton.ir.context object at 0x7f2fd75bc070>
2025-05-07T20:31:43.9834077Z 
2025-05-07T20:31:43.9834249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.9834783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.9835266Z                            module_map=module_map)
2025-05-07T20:31:43.9835642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.9836092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.9836369Z E       ^
2025-05-07T20:31:43.9836834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.9837289Z 
2025-05-07T20:31:43.9837722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.9838257Z 
2025-05-07T20:31:43.9838364Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.9838796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.9839203Z     T=128,
2025-05-07T20:31:43.9839410Z     D=5120,
2025-05-07T20:31:43.9839620Z     scale_ub=1200.0,
2025-05-07T20:31:43.9839862Z     contiguous=True,
2025-05-07T20:31:43.9840360Z     compiled=False,
2025-05-07T20:31:43.9840678Z )
2025-05-07T20:31:44.2109247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2110753Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.2111486Z 
2025-05-07T20:31:44.2111657Z     @given(
2025-05-07T20:31:44.2112131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2112513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2112852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2113197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2113539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2113836Z     )
2025-05-07T20:31:44.2114192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2114848Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2115104Z         self,
2025-05-07T20:31:44.2115313Z         T: int,
2025-05-07T20:31:44.2115517Z         D: int,
2025-05-07T20:31:44.2115747Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2116034Z         contiguous: bool,
2025-05-07T20:31:44.2116651Z         compiled: bool,
2025-05-07T20:31:44.2116923Z     ) -> None:
2025-05-07T20:31:44.2117156Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2117414Z     
2025-05-07T20:31:44.2117695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2118055Z     
2025-05-07T20:31:44.2118260Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2118561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2118886Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2119140Z         x0 = x[:, :D]
2025-05-07T20:31:44.2119371Z         x1 = x[:, D:]
2025-05-07T20:31:44.2119584Z     
2025-05-07T20:31:44.2119782Z         if contiguous:
2025-05-07T20:31:44.2120030Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2120298Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2120552Z     
2025-05-07T20:31:44.2120755Z         if scale_ub is not None:
2025-05-07T20:31:44.2121038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2121405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2121730Z             )
2025-05-07T20:31:44.2121931Z         else:
2025-05-07T20:31:44.2122155Z             scale_ub_tensor = None
2025-05-07T20:31:44.2122421Z     
2025-05-07T20:31:44.2122660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2122989Z             op = silu_mul_quant
2025-05-07T20:31:44.2123254Z             if compiled:
2025-05-07T20:31:44.2123515Z                 op = torch.compile(op)
2025-05-07T20:31:44.2123820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2124116Z     
2025-05-07T20:31:44.2124325Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2124500Z 
2025-05-07T20:31:44.2124607Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2124916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2125266Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2125560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2126463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2127182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2127738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2128426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2129099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2129644Z     kernel = self.compile(
2025-05-07T20:31:44.2130191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2130861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2131270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2131518Z 
2025-05-07T20:31:44.2131736Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd75af970>
2025-05-07T20:31:44.2132813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2134198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8303c10>}
2025-05-07T20:31:44.2135543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2136565Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7766970>
2025-05-07T20:31:44.2136856Z 
2025-05-07T20:31:44.2137040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2137659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2138135Z                            module_map=module_map)
2025-05-07T20:31:44.2138510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2138868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2139141Z E       ^
2025-05-07T20:31:44.2139607Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2140351Z 
2025-05-07T20:31:44.2140788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2141406Z 
2025-05-07T20:31:44.2141513Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2141938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2142361Z     T=1,
2025-05-07T20:31:44.2142555Z     D=7168,
2025-05-07T20:31:44.2142759Z     scale_ub=1200.0,
2025-05-07T20:31:44.2142994Z     contiguous=True,
2025-05-07T20:31:44.2143222Z     compiled=True,
2025-05-07T20:31:44.2143443Z )
2025-05-07T20:31:44.2143772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2144271Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.2144534Z 
2025-05-07T20:31:44.2144616Z     @given(
2025-05-07T20:31:44.2144855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2145182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2145492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2145840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2146180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2146469Z     )
2025-05-07T20:31:44.2146833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2147425Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2147683Z         self,
2025-05-07T20:31:44.2147884Z         T: int,
2025-05-07T20:31:44.2148092Z         D: int,
2025-05-07T20:31:44.2148323Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2148603Z         contiguous: bool,
2025-05-07T20:31:44.2148854Z         compiled: bool,
2025-05-07T20:31:44.2149085Z     ) -> None:
2025-05-07T20:31:44.2149304Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2149558Z     
2025-05-07T20:31:44.2149841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2150190Z     
2025-05-07T20:31:44.2150392Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2150691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2151005Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2151255Z         x0 = x[:, :D]
2025-05-07T20:31:44.2151480Z         x1 = x[:, D:]
2025-05-07T20:31:44.2151690Z     
2025-05-07T20:31:44.2151897Z         if contiguous:
2025-05-07T20:31:44.2152137Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2152399Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2152652Z     
2025-05-07T20:31:44.2152852Z         if scale_ub is not None:
2025-05-07T20:31:44.2153138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2153479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2153795Z             )
2025-05-07T20:31:44.2153998Z         else:
2025-05-07T20:31:44.2154211Z             scale_ub_tensor = None
2025-05-07T20:31:44.2154473Z     
2025-05-07T20:31:44.2154717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2155037Z             op = silu_mul_quant
2025-05-07T20:31:44.2155299Z             if compiled:
2025-05-07T20:31:44.2155556Z                 op = torch.compile(op)
2025-05-07T20:31:44.2155856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2156140Z     
2025-05-07T20:31:44.2156345Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2156658Z 
2025-05-07T20:31:44.2156761Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2157064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2157408Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2157703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2158266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2158843Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2159512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2160202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2160755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2161443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2162123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2162658Z     kernel = self.compile(
2025-05-07T20:31:44.2163211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2163875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2164283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2164520Z 
2025-05-07T20:31:44.2164731Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd75e2d00>
2025-05-07T20:31:44.2165830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2167277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd83601f0>}
2025-05-07T20:31:44.2168626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2169641Z context = <triton._C.libtriton.ir.context object at 0x7f2fd74b1470>
2025-05-07T20:31:44.2169943Z 
2025-05-07T20:31:44.2170113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2170652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2171128Z                            module_map=module_map)
2025-05-07T20:31:44.2171498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2171865Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2172135Z E       ^
2025-05-07T20:31:44.2172606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2173122Z 
2025-05-07T20:31:44.2173547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2174068Z 
2025-05-07T20:31:44.2174175Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2174600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2175003Z     T=1,
2025-05-07T20:31:44.2175194Z     D=7168,
2025-05-07T20:31:44.2175400Z     scale_ub=1200.0,
2025-05-07T20:31:44.2175630Z     contiguous=False,
2025-05-07T20:31:44.2175864Z     compiled=True,
2025-05-07T20:31:44.2176076Z )
2025-05-07T20:31:44.5953707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5954461Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.5954846Z 
2025-05-07T20:31:44.5955421Z     @given(
2025-05-07T20:31:44.5955748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5956167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5956520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5956855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5957197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5957493Z     )
2025-05-07T20:31:44.5957850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5958292Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5958541Z         self,
2025-05-07T20:31:44.5958747Z         T: int,
2025-05-07T20:31:44.5958946Z         D: int,
2025-05-07T20:31:44.5959176Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5959456Z         contiguous: bool,
2025-05-07T20:31:44.5959697Z         compiled: bool,
2025-05-07T20:31:44.5959934Z     ) -> None:
2025-05-07T20:31:44.5960161Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5960425Z     
2025-05-07T20:31:44.5960702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5961061Z     
2025-05-07T20:31:44.5961265Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5961564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5961877Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5962128Z         x0 = x[:, :D]
2025-05-07T20:31:44.5962353Z         x1 = x[:, D:]
2025-05-07T20:31:44.5962564Z     
2025-05-07T20:31:44.5962789Z         if contiguous:
2025-05-07T20:31:44.5963053Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5963315Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5963566Z     
2025-05-07T20:31:44.5963767Z         if scale_ub is not None:
2025-05-07T20:31:44.5964044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5964389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5964708Z             )
2025-05-07T20:31:44.5964902Z         else:
2025-05-07T20:31:44.5965281Z             scale_ub_tensor = None
2025-05-07T20:31:44.5965543Z     
2025-05-07T20:31:44.5965776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5966101Z             op = silu_mul_quant
2025-05-07T20:31:44.5966360Z             if compiled:
2025-05-07T20:31:44.5966610Z                 op = torch.compile(op)
2025-05-07T20:31:44.5966915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5967200Z     
2025-05-07T20:31:44.5967400Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5967568Z 
2025-05-07T20:31:44.5967672Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5967973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5968312Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5968596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5969164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.5969745Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.5970419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5971101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5971641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5972324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5972982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5973529Z     kernel = self.compile(
2025-05-07T20:31:44.5974085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5974746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5975147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5975503Z 
2025-05-07T20:31:44.5975716Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd74c1a90>
2025-05-07T20:31:44.5976796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5978191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8f718b0>}
2025-05-07T20:31:44.5979524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5980561Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7859070>
2025-05-07T20:31:44.5980864Z 
2025-05-07T20:31:44.5981039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5981686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5982153Z                            module_map=module_map)
2025-05-07T20:31:44.5982528Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5982916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5983209Z E       ^
2025-05-07T20:31:44.5983670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5984126Z 
2025-05-07T20:31:44.5984542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5985052Z 
2025-05-07T20:31:44.5985165Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5985577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5986101Z     T=1,
2025-05-07T20:31:44.5986298Z     D=7168,
2025-05-07T20:31:44.5986503Z     scale_ub=None,
2025-05-07T20:31:44.5986721Z     contiguous=False,
2025-05-07T20:31:44.5986959Z     compiled=True,
2025-05-07T20:31:44.5987174Z )
2025-05-07T20:31:44.7120646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7121414Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.7121789Z 
2025-05-07T20:31:44.7121903Z     @given(
2025-05-07T20:31:44.7122153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7122470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7122832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7123499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7124169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7124737Z     )
2025-05-07T20:31:44.7125463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7126374Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7126886Z         self,
2025-05-07T20:31:44.7127279Z         T: int,
2025-05-07T20:31:44.7127671Z         D: int,
2025-05-07T20:31:44.7128115Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7138402Z         contiguous: bool,
2025-05-07T20:31:44.7138665Z         compiled: bool,
2025-05-07T20:31:44.7138897Z     ) -> None:
2025-05-07T20:31:44.7139128Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7139384Z     
2025-05-07T20:31:44.7139666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7140018Z     
2025-05-07T20:31:44.7140514Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.7140823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.7141190Z         x = x_sign * x_clamp
2025-05-07T20:31:44.7141441Z         x0 = x[:, :D]
2025-05-07T20:31:44.7141669Z         x1 = x[:, D:]
2025-05-07T20:31:44.7142214Z     
2025-05-07T20:31:44.7142438Z         if contiguous:
2025-05-07T20:31:44.7142705Z             x0 = x0.contiguous()
2025-05-07T20:31:44.7142966Z             x1 = x1.contiguous()
2025-05-07T20:31:44.7143220Z     
2025-05-07T20:31:44.7143420Z         if scale_ub is not None:
2025-05-07T20:31:44.7143696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.7144045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.7144365Z             )
2025-05-07T20:31:44.7144558Z         else:
2025-05-07T20:31:44.7144776Z             scale_ub_tensor = None
2025-05-07T20:31:44.7145037Z     
2025-05-07T20:31:44.7145269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.7145601Z             op = silu_mul_quant
2025-05-07T20:31:44.7145862Z             if compiled:
2025-05-07T20:31:44.7146126Z                 op = torch.compile(op)
2025-05-07T20:31:44.7146431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.7146720Z     
2025-05-07T20:31:44.7146935Z         y_fp8, y_scale = fn()
2025-05-07T20:31:44.7147232Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:44.7147533Z     
2025-05-07T20:31:44.7147781Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.7148117Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:44.7148421Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:44.7148747Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:44.7149106Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.7149423Z     
2025-05-07T20:31:44.7149630Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:44.7149830Z 
2025-05-07T20:31:44.7149940Z moe/activation_test.py:126: 
2025-05-07T20:31:44.7150242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7150588Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:44.7150924Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.7151868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:44.7152631Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:44.7153187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.7153884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.7154582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:44.7155315Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.7156073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:44.7156843Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.7157581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:44.7158224Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:44.7158843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:44.7159361Z     fn()
2025-05-07T20:31:44.7159873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:44.7160456Z     self.fn.run(
2025-05-07T20:31:44.7160930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.7161456Z     kernel = self.compile(
2025-05-07T20:31:44.7162011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.7162670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7163152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7163421Z 
2025-05-07T20:31:44.7163659Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7884ca0>
2025-05-07T20:31:44.7164745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.7166131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7721430>}
2025-05-07T20:31:44.7167476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.7168513Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77388b0>
2025-05-07T20:31:44.7168811Z 
2025-05-07T20:31:44.7168981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.7169518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7169995Z                            module_map=module_map)
2025-05-07T20:31:44.7170366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7170736Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:44.7171011Z E       ^
2025-05-07T20:31:44.7171476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7171932Z 
2025-05-07T20:31:44.7172354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.7172873Z 
2025-05-07T20:31:44.7172979Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7173488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7173891Z     T=1,
2025-05-07T20:31:44.7174089Z     D=5120,
2025-05-07T20:31:44.7174294Z     scale_ub=1200.0,
2025-05-07T20:31:44.7174522Z     contiguous=False,
2025-05-07T20:31:44.7174760Z     compiled=True,
2025-05-07T20:31:44.7174982Z )
2025-05-07T20:31:44.9156846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9157594Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.9158015Z 
2025-05-07T20:31:44.9158131Z     @given(
2025-05-07T20:31:44.9158467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9158915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9159238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9159592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9159942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9160277Z     )
2025-05-07T20:31:44.9160646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9161115Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9161364Z         self,
2025-05-07T20:31:44.9161580Z         T: int,
2025-05-07T20:31:44.9161792Z         D: int,
2025-05-07T20:31:44.9162019Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9162309Z         contiguous: bool,
2025-05-07T20:31:44.9162565Z         compiled: bool,
2025-05-07T20:31:44.9162798Z     ) -> None:
2025-05-07T20:31:44.9163036Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9163298Z     
2025-05-07T20:31:44.9163580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9163949Z     
2025-05-07T20:31:44.9164177Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9164482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9164799Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9165402Z         x0 = x[:, :D]
2025-05-07T20:31:44.9165629Z         x1 = x[:, D:]
2025-05-07T20:31:44.9165839Z     
2025-05-07T20:31:44.9166035Z         if contiguous:
2025-05-07T20:31:44.9166280Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9166546Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9166798Z     
2025-05-07T20:31:44.9166999Z         if scale_ub is not None:
2025-05-07T20:31:44.9167279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9167629Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9167959Z             )
2025-05-07T20:31:44.9168158Z         else:
2025-05-07T20:31:44.9168380Z             scale_ub_tensor = None
2025-05-07T20:31:44.9168644Z     
2025-05-07T20:31:44.9168881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9169211Z             op = silu_mul_quant
2025-05-07T20:31:44.9169478Z             if compiled:
2025-05-07T20:31:44.9169734Z                 op = torch.compile(op)
2025-05-07T20:31:44.9170057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9170347Z     
2025-05-07T20:31:44.9170554Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9170726Z 
2025-05-07T20:31:44.9170832Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9171140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9171485Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9171774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9172355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.9172923Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.9173600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9174293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9174996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9175711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9176377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9176928Z     kernel = self.compile(
2025-05-07T20:31:44.9177489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9178156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9178559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9178801Z 
2025-05-07T20:31:44.9179015Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7700580>
2025-05-07T20:31:44.9180108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9181582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7721e50>}
2025-05-07T20:31:44.9182987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9184013Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77081f0>
2025-05-07T20:31:44.9184310Z 
2025-05-07T20:31:44.9184481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9185014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9185482Z                            module_map=module_map)
2025-05-07T20:31:44.9185857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9186348Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9186621Z E       ^
2025-05-07T20:31:44.9187091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9187549Z 
2025-05-07T20:31:44.9187967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9188482Z 
2025-05-07T20:31:44.9188597Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9189014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9189427Z     T=1,
2025-05-07T20:31:44.9189624Z     D=5120,
2025-05-07T20:31:44.9189830Z     scale_ub=1200.0,
2025-05-07T20:31:44.9190061Z     contiguous=False,
2025-05-07T20:31:44.9190305Z     compiled=False,
2025-05-07T20:31:44.9190529Z )
2025-05-07T20:31:44.9190850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9191362Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.9191633Z 
2025-05-07T20:31:44.9191720Z     @given(
2025-05-07T20:31:44.9191955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9192283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9192607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9192949Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9193297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9193598Z     )
2025-05-07T20:31:44.9193958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9194406Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9194665Z         self,
2025-05-07T20:31:44.9194874Z         T: int,
2025-05-07T20:31:44.9195078Z         D: int,
2025-05-07T20:31:44.9195308Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9195592Z         contiguous: bool,
2025-05-07T20:31:44.9195928Z         compiled: bool,
2025-05-07T20:31:44.9196167Z     ) -> None:
2025-05-07T20:31:44.9196395Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9196647Z     
2025-05-07T20:31:44.9196930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9197287Z     
2025-05-07T20:31:44.9197485Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9197788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9198108Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9198354Z         x0 = x[:, :D]
2025-05-07T20:31:44.9198585Z         x1 = x[:, D:]
2025-05-07T20:31:44.9198800Z     
2025-05-07T20:31:44.9198991Z         if contiguous:
2025-05-07T20:31:44.9199235Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9199505Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9199757Z     
2025-05-07T20:31:44.9199959Z         if scale_ub is not None:
2025-05-07T20:31:44.9200243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9200604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9200918Z             )
2025-05-07T20:31:44.9201123Z         else:
2025-05-07T20:31:44.9201344Z             scale_ub_tensor = None
2025-05-07T20:31:44.9201603Z     
2025-05-07T20:31:44.9201844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9202167Z             op = silu_mul_quant
2025-05-07T20:31:44.9202427Z             if compiled:
2025-05-07T20:31:44.9202686Z                 op = torch.compile(op)
2025-05-07T20:31:44.9202996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9203279Z     
2025-05-07T20:31:44.9203482Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9203660Z 
2025-05-07T20:31:44.9203764Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9204072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9204406Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9204698Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9205502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9206194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9206746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9207439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9208114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9208651Z     kernel = self.compile(
2025-05-07T20:31:44.9209203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9209870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9210273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9210527Z 
2025-05-07T20:31:44.9210743Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd76e6e50>
2025-05-07T20:31:44.9211824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9213196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76e7820>}
2025-05-07T20:31:44.9214539Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9215558Z context = <triton._C.libtriton.ir.context object at 0x7f2fd70fa1f0>
2025-05-07T20:31:44.9215857Z 
2025-05-07T20:31:44.9216111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9216656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9217130Z                            module_map=module_map)
2025-05-07T20:31:44.9217501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9217864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9218143Z E       ^
2025-05-07T20:31:44.9218605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9219061Z 
2025-05-07T20:31:44.9219487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9220010Z 
2025-05-07T20:31:44.9220117Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9220540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9220944Z     T=16384,
2025-05-07T20:31:44.9221255Z     D=5120,
2025-05-07T20:31:44.9221464Z     scale_ub=1200.0,
2025-05-07T20:31:44.9221693Z     contiguous=False,
2025-05-07T20:31:44.9221929Z     compiled=True,
2025-05-07T20:31:44.9222142Z )
2025-05-07T20:31:45.0411463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.0412853Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.0413270Z 
2025-05-07T20:31:45.0413383Z     @given(
2025-05-07T20:31:45.0413644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.0413970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.0414287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.0414620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.0414964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.0415261Z     )
2025-05-07T20:31:45.0415636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.0416469Z     def test_silu_mul_quant(
2025-05-07T20:31:45.0416721Z         self,
2025-05-07T20:31:45.0416918Z         T: int,
2025-05-07T20:31:45.0417125Z         D: int,
2025-05-07T20:31:45.0417352Z         scale_ub: Optional[float],
2025-05-07T20:31:45.0417631Z         contiguous: bool,
2025-05-07T20:31:45.0417878Z         compiled: bool,
2025-05-07T20:31:45.0418115Z     ) -> None:
2025-05-07T20:31:45.0418339Z         torch.manual_seed(2025)
2025-05-07T20:31:45.0418593Z     
2025-05-07T20:31:45.0418878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.0419231Z     
2025-05-07T20:31:45.0419433Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.0419738Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.0420064Z         x = x_sign * x_clamp
2025-05-07T20:31:45.0420310Z         x0 = x[:, :D]
2025-05-07T20:31:45.0420536Z         x1 = x[:, D:]
2025-05-07T20:31:45.0420755Z     
2025-05-07T20:31:45.0420955Z         if contiguous:
2025-05-07T20:31:45.0421292Z             x0 = x0.contiguous()
2025-05-07T20:31:45.0421568Z             x1 = x1.contiguous()
2025-05-07T20:31:45.0421813Z     
2025-05-07T20:31:45.0422019Z         if scale_ub is not None:
2025-05-07T20:31:45.0422307Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.0422647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.0422969Z             )
2025-05-07T20:31:45.0423176Z         else:
2025-05-07T20:31:45.0423391Z             scale_ub_tensor = None
2025-05-07T20:31:45.0423652Z     
2025-05-07T20:31:45.0423895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.0424215Z             op = silu_mul_quant
2025-05-07T20:31:45.0424482Z             if compiled:
2025-05-07T20:31:45.0424742Z                 op = torch.compile(op)
2025-05-07T20:31:45.0425059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0425341Z     
2025-05-07T20:31:45.0425546Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.0425872Z 
2025-05-07T20:31:45.0425982Z moe/activation_test.py:117: 
2025-05-07T20:31:45.0426293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0426633Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.0426930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0427498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.0428059Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.0428730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.0429426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.0429978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.0430663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.0431354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.0431896Z     kernel = self.compile(
2025-05-07T20:31:45.0432441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.0433158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.0433565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0433799Z 
2025-05-07T20:31:45.0434016Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd70f2f10>
2025-05-07T20:31:45.0435094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.0436486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd70b6790>}
2025-05-07T20:31:45.0437923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.0438946Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7128930>
2025-05-07T20:31:45.0439240Z 
2025-05-07T20:31:45.0439416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.0439942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.0440716Z                            module_map=module_map)
2025-05-07T20:31:45.0441093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.0441450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.0441720Z E       ^
2025-05-07T20:31:45.0442201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.0442651Z 
2025-05-07T20:31:45.0443075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.0443588Z 
2025-05-07T20:31:45.0443696Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.0444117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.0444530Z     T=2048,
2025-05-07T20:31:45.0444722Z     D=7168,
2025-05-07T20:31:45.0444919Z     scale_ub=1200.0,
2025-05-07T20:31:45.0445153Z     contiguous=False,
2025-05-07T20:31:45.0445384Z     compiled=True,
2025-05-07T20:31:45.0445604Z )
2025-05-07T20:31:45.0445930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.0446443Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.0446722Z 
2025-05-07T20:31:45.0446931Z     @given(
2025-05-07T20:31:45.0447176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.0447500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.0447810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.0448155Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.0448495Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.0448783Z     )
2025-05-07T20:31:45.0449144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.0449604Z     def test_silu_mul_quant(
2025-05-07T20:31:45.0449857Z         self,
2025-05-07T20:31:45.0450054Z         T: int,
2025-05-07T20:31:45.0450261Z         D: int,
2025-05-07T20:31:45.0450491Z         scale_ub: Optional[float],
2025-05-07T20:31:45.0450768Z         contiguous: bool,
2025-05-07T20:31:45.0451021Z         compiled: bool,
2025-05-07T20:31:45.0451257Z     ) -> None:
2025-05-07T20:31:45.0451481Z         torch.manual_seed(2025)
2025-05-07T20:31:45.0451748Z     
2025-05-07T20:31:45.0452030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.0452376Z     
2025-05-07T20:31:45.0452583Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.0452884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.0453200Z         x = x_sign * x_clamp
2025-05-07T20:31:45.0453451Z         x0 = x[:, :D]
2025-05-07T20:31:45.0453679Z         x1 = x[:, D:]
2025-05-07T20:31:45.0453888Z     
2025-05-07T20:31:45.0454082Z         if contiguous:
2025-05-07T20:31:45.0454324Z             x0 = x0.contiguous()
2025-05-07T20:31:45.0454585Z             x1 = x1.contiguous()
2025-05-07T20:31:45.0454843Z     
2025-05-07T20:31:45.0455044Z         if scale_ub is not None:
2025-05-07T20:31:45.0455327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.0455668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.0456009Z             )
2025-05-07T20:31:45.0456367Z         else:
2025-05-07T20:31:45.0456596Z             scale_ub_tensor = None
2025-05-07T20:31:45.0456859Z     
2025-05-07T20:31:45.0457096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.0457425Z             op = silu_mul_quant
2025-05-07T20:31:45.0466003Z             if compiled:
2025-05-07T20:31:45.0466304Z                 op = torch.compile(op)
2025-05-07T20:31:45.0466620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0466915Z     
2025-05-07T20:31:45.0467115Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.0467297Z 
2025-05-07T20:31:45.0467402Z moe/activation_test.py:117: 
2025-05-07T20:31:45.0467713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0468054Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.0468350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0468920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.0469515Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.0470176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.0470872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.0471422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.0472104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.0472786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.0473336Z     kernel = self.compile(
2025-05-07T20:31:45.0473896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.0474553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.0475076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0475319Z 
2025-05-07T20:31:45.0475536Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7117d00>
2025-05-07T20:31:45.0476627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.0477986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76bc4c0>}
2025-05-07T20:31:45.0479335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.0480368Z context = <triton._C.libtriton.ir.context object at 0x7f2fd76bfe70>
2025-05-07T20:31:45.0480669Z 
2025-05-07T20:31:45.0480854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.0481383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.0481865Z                            module_map=module_map)
2025-05-07T20:31:45.0482245Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.0482614Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.0482878Z E       ^
2025-05-07T20:31:45.0483351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.0483807Z 
2025-05-07T20:31:45.0484232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.0484743Z 
2025-05-07T20:31:45.3146830Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3147503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3148516Z     T=1,
2025-05-07T20:31:45.3148789Z     D=5120,
2025-05-07T20:31:45.3149064Z     scale_ub=None,
2025-05-07T20:31:45.3149353Z     contiguous=False,
2025-05-07T20:31:45.3149594Z     compiled=False,
2025-05-07T20:31:45.3149813Z )
2025-05-07T20:31:45.3150134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3150641Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.3150917Z 
2025-05-07T20:31:45.3151001Z     @given(
2025-05-07T20:31:45.3151246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3151566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3151887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3152236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3152577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3152876Z     )
2025-05-07T20:31:45.3153248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3153703Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3153959Z         self,
2025-05-07T20:31:45.3154169Z         T: int,
2025-05-07T20:31:45.3154380Z         D: int,
2025-05-07T20:31:45.3154605Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3154892Z         contiguous: bool,
2025-05-07T20:31:45.3155150Z         compiled: bool,
2025-05-07T20:31:45.3155384Z     ) -> None:
2025-05-07T20:31:45.3155621Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3155881Z     
2025-05-07T20:31:45.3156160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3156516Z     
2025-05-07T20:31:45.3156725Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3157024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3157352Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3157608Z         x0 = x[:, :D]
2025-05-07T20:31:45.3157828Z         x1 = x[:, D:]
2025-05-07T20:31:45.3158244Z     
2025-05-07T20:31:45.3158448Z         if contiguous:
2025-05-07T20:31:45.3158688Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3158961Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3159217Z     
2025-05-07T20:31:45.3159415Z         if scale_ub is not None:
2025-05-07T20:31:45.3159706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3160065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3160390Z             )
2025-05-07T20:31:45.3160588Z         else:
2025-05-07T20:31:45.3160813Z             scale_ub_tensor = None
2025-05-07T20:31:45.3161078Z     
2025-05-07T20:31:45.3161318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3161643Z             op = silu_mul_quant
2025-05-07T20:31:45.3161906Z             if compiled:
2025-05-07T20:31:45.3162162Z                 op = torch.compile(op)
2025-05-07T20:31:45.3162481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3162778Z     
2025-05-07T20:31:45.3162979Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3163160Z 
2025-05-07T20:31:45.3163265Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3163571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3163916Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3164206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3164909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3165606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3166149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3166840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3167509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3168155Z     kernel = self.compile(
2025-05-07T20:31:45.3168708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3169378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3169792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3170028Z 
2025-05-07T20:31:45.3170242Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd784a670>
2025-05-07T20:31:45.3171353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3172742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76bc820>}
2025-05-07T20:31:45.3174155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3175195Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7475e30>
2025-05-07T20:31:45.3175493Z 
2025-05-07T20:31:45.3175663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3176199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3176666Z                            module_map=module_map)
2025-05-07T20:31:45.3177048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3177412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.3177684Z E       ^
2025-05-07T20:31:45.3178152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3178608Z 
2025-05-07T20:31:45.3179106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3179633Z 
2025-05-07T20:31:45.3179748Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3180160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3180571Z     T=4096,
2025-05-07T20:31:45.3180768Z     D=7168,
2025-05-07T20:31:45.3180965Z     scale_ub=1200.0,
2025-05-07T20:31:45.3181282Z     contiguous=False,
2025-05-07T20:31:45.3181517Z     compiled=False,
2025-05-07T20:31:45.3181729Z )
2025-05-07T20:31:45.3182047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3182554Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.3182837Z 
2025-05-07T20:31:45.3182923Z     @given(
2025-05-07T20:31:45.3183155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3183485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3183810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3184141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3184480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3184775Z     )
2025-05-07T20:31:45.3185132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3185574Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3185826Z         self,
2025-05-07T20:31:45.3186030Z         T: int,
2025-05-07T20:31:45.3186227Z         D: int,
2025-05-07T20:31:45.3186454Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3186739Z         contiguous: bool,
2025-05-07T20:31:45.3186982Z         compiled: bool,
2025-05-07T20:31:45.3187212Z     ) -> None:
2025-05-07T20:31:45.3187438Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3187686Z     
2025-05-07T20:31:45.3187967Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3188416Z     
2025-05-07T20:31:45.3188611Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3188908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3189228Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3189471Z         x0 = x[:, :D]
2025-05-07T20:31:45.3189698Z         x1 = x[:, D:]
2025-05-07T20:31:45.3189919Z     
2025-05-07T20:31:45.3190119Z         if contiguous:
2025-05-07T20:31:45.3190355Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3190636Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3190894Z     
2025-05-07T20:31:45.3191089Z         if scale_ub is not None:
2025-05-07T20:31:45.3191376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3191729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3192040Z             )
2025-05-07T20:31:45.3192245Z         else:
2025-05-07T20:31:45.3192467Z             scale_ub_tensor = None
2025-05-07T20:31:45.3192721Z     
2025-05-07T20:31:45.3192973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3193304Z             op = silu_mul_quant
2025-05-07T20:31:45.3193558Z             if compiled:
2025-05-07T20:31:45.3193819Z                 op = torch.compile(op)
2025-05-07T20:31:45.3194126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3194404Z     
2025-05-07T20:31:45.3194606Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3194782Z 
2025-05-07T20:31:45.3194885Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3195194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3195528Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3195818Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3196520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3197207Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3197837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3198542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3199209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3199742Z     kernel = self.compile(
2025-05-07T20:31:45.3200288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3200952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3201354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3201594Z 
2025-05-07T20:31:45.3201805Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd747f760>
2025-05-07T20:31:45.3202892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3204261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7466af0>}
2025-05-07T20:31:45.3205604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3206619Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77c2630>
2025-05-07T20:31:45.3206918Z 
2025-05-07T20:31:45.3207089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3207618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3208101Z                            module_map=module_map)
2025-05-07T20:31:45.3208590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3208953Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.3209223Z E       ^
2025-05-07T20:31:45.3209683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3210140Z 
2025-05-07T20:31:45.3210555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3211075Z 
2025-05-07T20:31:45.3211183Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3211606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3212011Z     T=16384,
2025-05-07T20:31:45.3212219Z     D=7168,
2025-05-07T20:31:45.3212419Z     scale_ub=None,
2025-05-07T20:31:45.3212637Z     contiguous=True,
2025-05-07T20:31:45.3212873Z     compiled=True,
2025-05-07T20:31:45.3213086Z )
2025-05-07T20:31:45.4387839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4388668Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4389055Z 
2025-05-07T20:31:45.4389168Z     @given(
2025-05-07T20:31:45.4389444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4389772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4390093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4390427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4390765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4391055Z     )
2025-05-07T20:31:45.4391405Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4391854Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4392102Z         self,
2025-05-07T20:31:45.4392296Z         T: int,
2025-05-07T20:31:45.4392499Z         D: int,
2025-05-07T20:31:45.4392724Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4393341Z         contiguous: bool,
2025-05-07T20:31:45.4393596Z         compiled: bool,
2025-05-07T20:31:45.4393833Z     ) -> None:
2025-05-07T20:31:45.4394051Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4394302Z     
2025-05-07T20:31:45.4394579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4394928Z     
2025-05-07T20:31:45.4395122Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4395424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4395742Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4395987Z         x0 = x[:, :D]
2025-05-07T20:31:45.4396208Z         x1 = x[:, D:]
2025-05-07T20:31:45.4396423Z     
2025-05-07T20:31:45.4396611Z         if contiguous:
2025-05-07T20:31:45.4396849Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4397112Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4397356Z     
2025-05-07T20:31:45.4397554Z         if scale_ub is not None:
2025-05-07T20:31:45.4397842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4398188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4398504Z             )
2025-05-07T20:31:45.4398706Z         else:
2025-05-07T20:31:45.4398917Z             scale_ub_tensor = None
2025-05-07T20:31:45.4399178Z     
2025-05-07T20:31:45.4399417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4399743Z             op = silu_mul_quant
2025-05-07T20:31:45.4400000Z             if compiled:
2025-05-07T20:31:45.4400257Z                 op = torch.compile(op)
2025-05-07T20:31:45.4400564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4400848Z     
2025-05-07T20:31:45.4401050Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4401219Z 
2025-05-07T20:31:45.4401327Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4401623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4401964Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4402426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4402993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4403559Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4404229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4404918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4405456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4406139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4406808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4407346Z     kernel = self.compile(
2025-05-07T20:31:45.4407892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4408556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4408961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4409194Z 
2025-05-07T20:31:45.4409404Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd734c820>
2025-05-07T20:31:45.4410482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4411856Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd764b790>}
2025-05-07T20:31:45.4413272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4414314Z context = <triton._C.libtriton.ir.context object at 0x7f2fd761beb0>
2025-05-07T20:31:45.4414603Z 
2025-05-07T20:31:45.4414774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4415304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4415774Z                            module_map=module_map)
2025-05-07T20:31:45.4416148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4416501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4416770Z E       ^
2025-05-07T20:31:45.4417242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4417690Z 
2025-05-07T20:31:45.4418109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4418635Z 
2025-05-07T20:31:45.4418743Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4419166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4419571Z     T=4096,
2025-05-07T20:31:45.4419759Z     D=5120,
2025-05-07T20:31:45.4419961Z     scale_ub=None,
2025-05-07T20:31:45.4420183Z     contiguous=False,
2025-05-07T20:31:45.4420410Z     compiled=True,
2025-05-07T20:31:45.4420624Z )
2025-05-07T20:31:45.4420952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4421531Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4421815Z 
2025-05-07T20:31:45.4421901Z     @given(
2025-05-07T20:31:45.4422133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4422460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4422775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4423211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4423550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4423843Z     )
2025-05-07T20:31:45.4424199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4424651Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4424900Z         self,
2025-05-07T20:31:45.4425096Z         T: int,
2025-05-07T20:31:45.4425305Z         D: int,
2025-05-07T20:31:45.4425533Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4425820Z         contiguous: bool,
2025-05-07T20:31:45.4426063Z         compiled: bool,
2025-05-07T20:31:45.4426295Z     ) -> None:
2025-05-07T20:31:45.4426523Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4426769Z     
2025-05-07T20:31:45.4427053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4427403Z     
2025-05-07T20:31:45.4427598Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4427899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4428229Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4428473Z         x0 = x[:, :D]
2025-05-07T20:31:45.4428696Z         x1 = x[:, D:]
2025-05-07T20:31:45.4428913Z     
2025-05-07T20:31:45.4429099Z         if contiguous:
2025-05-07T20:31:45.4429338Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4429605Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4429848Z     
2025-05-07T20:31:45.4430045Z         if scale_ub is not None:
2025-05-07T20:31:45.4430328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4430666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4430983Z             )
2025-05-07T20:31:45.4431185Z         else:
2025-05-07T20:31:45.4431404Z             scale_ub_tensor = None
2025-05-07T20:31:45.4431657Z     
2025-05-07T20:31:45.4431895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4432214Z             op = silu_mul_quant
2025-05-07T20:31:45.4432558Z             if compiled:
2025-05-07T20:31:45.4432860Z                 op = torch.compile(op)
2025-05-07T20:31:45.4433174Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4433453Z     
2025-05-07T20:31:45.4433654Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4433823Z 
2025-05-07T20:31:45.4433933Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4434230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4434570Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4434863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4435429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4435988Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4436658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4437351Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4437904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4438603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4439273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4439810Z     kernel = self.compile(
2025-05-07T20:31:45.4440648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4441323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4441731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4441966Z 
2025-05-07T20:31:45.4442187Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7812a00>
2025-05-07T20:31:45.4443276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4444785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72e0550>}
2025-05-07T20:31:45.4446125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4447143Z context = <triton._C.libtriton.ir.context object at 0x7f2fd730a770>
2025-05-07T20:31:45.4447433Z 
2025-05-07T20:31:45.4447605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4448135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4448608Z                            module_map=module_map)
2025-05-07T20:31:45.4448985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4449339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4449605Z E       ^
2025-05-07T20:31:45.4450077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4450525Z 
2025-05-07T20:31:45.4450941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4451468Z 
2025-05-07T20:31:45.8512218Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8512929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8513490Z     T=4096,
2025-05-07T20:31:45.8513754Z     D=5120,
2025-05-07T20:31:45.8514016Z     scale_ub=1200.0,
2025-05-07T20:31:45.8514310Z     contiguous=False,
2025-05-07T20:31:45.8514598Z     compiled=False,
2025-05-07T20:31:45.8514821Z )
2025-05-07T20:31:45.8515517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.8516050Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.8516330Z 
2025-05-07T20:31:45.8516423Z     @given(
2025-05-07T20:31:45.8516662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8516994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8517317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8517666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8518009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8518313Z     )
2025-05-07T20:31:45.8518685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8519142Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8519399Z         self,
2025-05-07T20:31:45.8519605Z         T: int,
2025-05-07T20:31:45.8519809Z         D: int,
2025-05-07T20:31:45.8520052Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8520335Z         contiguous: bool,
2025-05-07T20:31:45.8520586Z         compiled: bool,
2025-05-07T20:31:45.8520830Z     ) -> None:
2025-05-07T20:31:45.8521062Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8521315Z     
2025-05-07T20:31:45.8521601Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8521964Z     
2025-05-07T20:31:45.8522161Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.8522466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.8522796Z         x = x_sign * x_clamp
2025-05-07T20:31:45.8523050Z         x0 = x[:, :D]
2025-05-07T20:31:45.8523272Z         x1 = x[:, D:]
2025-05-07T20:31:45.8523494Z     
2025-05-07T20:31:45.8523691Z         if contiguous:
2025-05-07T20:31:45.8523935Z             x0 = x0.contiguous()
2025-05-07T20:31:45.8524207Z             x1 = x1.contiguous()
2025-05-07T20:31:45.8524458Z     
2025-05-07T20:31:45.8524648Z         if scale_ub is not None:
2025-05-07T20:31:45.8525100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.8525447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.8525761Z             )
2025-05-07T20:31:45.8525965Z         else:
2025-05-07T20:31:45.8526188Z             scale_ub_tensor = None
2025-05-07T20:31:45.8526444Z     
2025-05-07T20:31:45.8526689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.8527022Z             op = silu_mul_quant
2025-05-07T20:31:45.8527279Z             if compiled:
2025-05-07T20:31:45.8527540Z                 op = torch.compile(op)
2025-05-07T20:31:45.8527851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8528139Z     
2025-05-07T20:31:45.8528337Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.8528516Z 
2025-05-07T20:31:45.8528621Z moe/activation_test.py:117: 
2025-05-07T20:31:45.8528936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8529287Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.8529587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8530298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.8530995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.8531575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.8532412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.8533097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.8533634Z     kernel = self.compile(
2025-05-07T20:31:45.8534189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.8534859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.8535408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8535647Z 
2025-05-07T20:31:45.8535860Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd72f40a0>
2025-05-07T20:31:45.8536948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.8538402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72a40d0>}
2025-05-07T20:31:45.8539755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.8541075Z context = <triton._C.libtriton.ir.context object at 0x7f2fd72a81f0>
2025-05-07T20:31:45.8541441Z 
2025-05-07T20:31:45.8541619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.8542167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.8542645Z                            module_map=module_map)
2025-05-07T20:31:45.8543035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.8543396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.8543669Z E       ^
2025-05-07T20:31:45.8544147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.8544603Z 
2025-05-07T20:31:45.8545023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.8545559Z 
2025-05-07T20:31:45.8545670Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8546241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8546656Z     T=4096,
2025-05-07T20:31:45.8546851Z     D=5120,
2025-05-07T20:31:45.8547055Z     scale_ub=1200.0,
2025-05-07T20:31:45.8547291Z     contiguous=False,
2025-05-07T20:31:45.8547527Z     compiled=True,
2025-05-07T20:31:45.8547745Z )
2025-05-07T20:31:45.8548077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.8548579Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.8548868Z 
2025-05-07T20:31:45.8548948Z     @given(
2025-05-07T20:31:45.8549188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8549504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8549826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8550168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8550513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8550816Z     )
2025-05-07T20:31:45.8551178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8551639Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8551892Z         self,
2025-05-07T20:31:45.8552101Z         T: int,
2025-05-07T20:31:45.8552316Z         D: int,
2025-05-07T20:31:45.8552544Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8552832Z         contiguous: bool,
2025-05-07T20:31:45.8553086Z         compiled: bool,
2025-05-07T20:31:45.8553316Z     ) -> None:
2025-05-07T20:31:45.8553548Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8553806Z     
2025-05-07T20:31:45.8554087Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8554444Z     
2025-05-07T20:31:45.8554647Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.8554953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.8555274Z         x = x_sign * x_clamp
2025-05-07T20:31:45.8555529Z         x0 = x[:, :D]
2025-05-07T20:31:45.8555881Z         x1 = x[:, D:]
2025-05-07T20:31:45.8556106Z     
2025-05-07T20:31:45.8556304Z         if contiguous:
2025-05-07T20:31:45.8556539Z             x0 = x0.contiguous()
2025-05-07T20:31:45.8556811Z             x1 = x1.contiguous()
2025-05-07T20:31:45.8557063Z     
2025-05-07T20:31:45.8557257Z         if scale_ub is not None:
2025-05-07T20:31:45.8557548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.8557891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.8558205Z             )
2025-05-07T20:31:45.8558403Z         else:
2025-05-07T20:31:45.8558622Z             scale_ub_tensor = None
2025-05-07T20:31:45.8558878Z     
2025-05-07T20:31:45.8559118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.8559444Z             op = silu_mul_quant
2025-05-07T20:31:45.8559700Z             if compiled:
2025-05-07T20:31:45.8559959Z                 op = torch.compile(op)
2025-05-07T20:31:45.8560273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8560566Z     
2025-05-07T20:31:45.8560760Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.8560938Z 
2025-05-07T20:31:45.8561043Z moe/activation_test.py:117: 
2025-05-07T20:31:45.8561346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8561681Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.8561973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8562539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.8563148Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.8563812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.8564499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.8565041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.8565819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.8566486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.8567022Z     kernel = self.compile(
2025-05-07T20:31:45.8567568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.8568222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.8568627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8568861Z 
2025-05-07T20:31:45.8569077Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd77c01c0>
2025-05-07T20:31:45.8570164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.8571531Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72a4dc0>}
2025-05-07T20:31:45.8572870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.8573893Z context = <triton._C.libtriton.ir.context object at 0x7f2fd72014f0>
2025-05-07T20:31:45.8574185Z 
2025-05-07T20:31:45.8574368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.8574892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.8575367Z                            module_map=module_map)
2025-05-07T20:31:45.8575744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.8576225Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.8576493Z E       ^
2025-05-07T20:31:45.8576962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.8577413Z 
2025-05-07T20:31:45.8577838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.8578349Z 
2025-05-07T20:31:46.1325324Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1325947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1326595Z     T=2048,
2025-05-07T20:31:46.1326871Z     D=7168,
2025-05-07T20:31:46.1327135Z     scale_ub=1200.0,
2025-05-07T20:31:46.1327364Z     contiguous=False,
2025-05-07T20:31:46.1327601Z     compiled=False,
2025-05-07T20:31:46.1327819Z )
2025-05-07T20:31:46.1328142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1328680Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.1328975Z 
2025-05-07T20:31:46.1329063Z     @given(
2025-05-07T20:31:46.1329298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1329620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.1329939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.1330276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1330614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1330911Z     )
2025-05-07T20:31:46.1331276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1331720Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1331972Z         self,
2025-05-07T20:31:46.1332179Z         T: int,
2025-05-07T20:31:46.1332381Z         D: int,
2025-05-07T20:31:46.1332611Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1332892Z         contiguous: bool,
2025-05-07T20:31:46.1333136Z         compiled: bool,
2025-05-07T20:31:46.1333770Z     ) -> None:
2025-05-07T20:31:46.1333998Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1334246Z     
2025-05-07T20:31:46.1334529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1334884Z     
2025-05-07T20:31:46.1335078Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1335381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1335701Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1335954Z         x0 = x[:, :D]
2025-05-07T20:31:46.1336179Z         x1 = x[:, D:]
2025-05-07T20:31:46.1336403Z     
2025-05-07T20:31:46.1336600Z         if contiguous:
2025-05-07T20:31:46.1336837Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1337110Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1337367Z     
2025-05-07T20:31:46.1337561Z         if scale_ub is not None:
2025-05-07T20:31:46.1337848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1338200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1338528Z             )
2025-05-07T20:31:46.1338736Z         else:
2025-05-07T20:31:46.1338957Z             scale_ub_tensor = None
2025-05-07T20:31:46.1339212Z     
2025-05-07T20:31:46.1339453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1339784Z             op = silu_mul_quant
2025-05-07T20:31:46.1340043Z             if compiled:
2025-05-07T20:31:46.1340605Z                 op = torch.compile(op)
2025-05-07T20:31:46.1340914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1341250Z     
2025-05-07T20:31:46.1341453Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.1341627Z 
2025-05-07T20:31:46.1341735Z moe/activation_test.py:117: 
2025-05-07T20:31:46.1342039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1342374Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.1342665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1343523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.1344229Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.1344784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1345478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1346148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1346684Z     kernel = self.compile(
2025-05-07T20:31:46.1347241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1347907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1348310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1348556Z 
2025-05-07T20:31:46.1348774Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd72cd1c0>
2025-05-07T20:31:46.1349860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1351275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd71b0670>}
2025-05-07T20:31:46.1352622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1353637Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7231df0>
2025-05-07T20:31:46.1353937Z 
2025-05-07T20:31:46.1354113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1354769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1355243Z                            module_map=module_map)
2025-05-07T20:31:46.1355616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1355977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.1356250Z E       ^
2025-05-07T20:31:46.1356714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1357172Z 
2025-05-07T20:31:46.1357589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1358112Z 
2025-05-07T20:31:46.1358218Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1358644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1359047Z     T=1,
2025-05-07T20:31:46.1359248Z     D=7168,
2025-05-07T20:31:46.1359459Z     scale_ub=None,
2025-05-07T20:31:46.1359682Z     contiguous=True,
2025-05-07T20:31:46.1359918Z     compiled=False,
2025-05-07T20:31:46.1360132Z )
2025-05-07T20:31:46.1360456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1360954Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.1361226Z 
2025-05-07T20:31:46.1361309Z     @given(
2025-05-07T20:31:46.1361549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1361869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.1362188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.1362536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1362872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1363172Z     )
2025-05-07T20:31:46.1363531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1364065Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1364318Z         self,
2025-05-07T20:31:46.1364522Z         T: int,
2025-05-07T20:31:46.1364754Z         D: int,
2025-05-07T20:31:46.1364977Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1365260Z         contiguous: bool,
2025-05-07T20:31:46.1365509Z         compiled: bool,
2025-05-07T20:31:46.1365731Z     ) -> None:
2025-05-07T20:31:46.1365957Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1366209Z     
2025-05-07T20:31:46.1366487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1366832Z     
2025-05-07T20:31:46.1367033Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1367334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1367647Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1367902Z         x0 = x[:, :D]
2025-05-07T20:31:46.1368128Z         x1 = x[:, D:]
2025-05-07T20:31:46.1368338Z     
2025-05-07T20:31:46.1368538Z         if contiguous:
2025-05-07T20:31:46.1368791Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1369054Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1369306Z     
2025-05-07T20:31:46.1369508Z         if scale_ub is not None:
2025-05-07T20:31:46.1369786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1370130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1370453Z             )
2025-05-07T20:31:46.1370648Z         else:
2025-05-07T20:31:46.1370871Z             scale_ub_tensor = None
2025-05-07T20:31:46.1371138Z     
2025-05-07T20:31:46.1371376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1371698Z             op = silu_mul_quant
2025-05-07T20:31:46.1371963Z             if compiled:
2025-05-07T20:31:46.1372223Z                 op = torch.compile(op)
2025-05-07T20:31:46.1372527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1372818Z     
2025-05-07T20:31:46.1373059Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.1373330Z 
2025-05-07T20:31:46.1373437Z moe/activation_test.py:117: 
2025-05-07T20:31:46.1373743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1374080Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.1374373Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1375075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.1375773Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.1376547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1377389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1378071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1378608Z     kernel = self.compile(
2025-05-07T20:31:46.1379175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1379856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1380258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1380501Z 
2025-05-07T20:31:46.1380713Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd71d0fa0>
2025-05-07T20:31:46.1381878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1383250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ee3160>}
2025-05-07T20:31:46.1384692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1385731Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ed4c70>
2025-05-07T20:31:46.1386029Z 
2025-05-07T20:31:46.1386199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1386733Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1387216Z                            module_map=module_map)
2025-05-07T20:31:46.1387586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1387947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.1388216Z E       ^
2025-05-07T20:31:46.1388680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1389134Z 
2025-05-07T20:31:46.1389558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1390092Z 
2025-05-07T20:31:46.1390198Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1390620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1391021Z     T=16384,
2025-05-07T20:31:46.1391226Z     D=7168,
2025-05-07T20:31:46.1391427Z     scale_ub=1200.0,
2025-05-07T20:31:46.1391652Z     contiguous=False,
2025-05-07T20:31:46.1391888Z     compiled=True,
2025-05-07T20:31:46.1392101Z )
2025-05-07T20:31:46.3299773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3300602Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.3301018Z 
2025-05-07T20:31:46.3301241Z     @given(
2025-05-07T20:31:46.3301575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3301991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3302328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3303088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3303430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3303723Z     )
2025-05-07T20:31:46.3304097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3304552Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3304799Z         self,
2025-05-07T20:31:46.3305012Z         T: int,
2025-05-07T20:31:46.3305218Z         D: int,
2025-05-07T20:31:46.3305440Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3305726Z         contiguous: bool,
2025-05-07T20:31:46.3305975Z         compiled: bool,
2025-05-07T20:31:46.3306209Z     ) -> None:
2025-05-07T20:31:46.3306439Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3306690Z     
2025-05-07T20:31:46.3306973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3307323Z     
2025-05-07T20:31:46.3307525Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3307838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3308156Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3308410Z         x0 = x[:, :D]
2025-05-07T20:31:46.3308637Z         x1 = x[:, D:]
2025-05-07T20:31:46.3308848Z     
2025-05-07T20:31:46.3309047Z         if contiguous:
2025-05-07T20:31:46.3309291Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3309560Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3309832Z     
2025-05-07T20:31:46.3310036Z         if scale_ub is not None:
2025-05-07T20:31:46.3310315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3310668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3319117Z             )
2025-05-07T20:31:46.3319345Z         else:
2025-05-07T20:31:46.3319572Z             scale_ub_tensor = None
2025-05-07T20:31:46.3319840Z     
2025-05-07T20:31:46.3320081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3320614Z             op = silu_mul_quant
2025-05-07T20:31:46.3320894Z             if compiled:
2025-05-07T20:31:46.3321151Z                 op = torch.compile(op)
2025-05-07T20:31:46.3321458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3321743Z     
2025-05-07T20:31:46.3321940Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3322122Z 
2025-05-07T20:31:46.3322228Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3322540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3322910Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3323230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3323800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.3324369Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.3325030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3325740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3326296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3326992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3327661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3328198Z     kernel = self.compile(
2025-05-07T20:31:46.3328747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3329409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3329816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3330056Z 
2025-05-07T20:31:46.3330264Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7079880>
2025-05-07T20:31:46.3331352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3332837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ee34c0>}
2025-05-07T20:31:46.3334180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3335200Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7276270>
2025-05-07T20:31:46.3335496Z 
2025-05-07T20:31:46.3335667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3336196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3336683Z                            module_map=module_map)
2025-05-07T20:31:46.3337062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3337421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3337682Z E       ^
2025-05-07T20:31:46.3338154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3338611Z 
2025-05-07T20:31:46.3339034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3339545Z 
2025-05-07T20:31:46.3339658Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3340371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3340786Z     T=1,
2025-05-07T20:31:46.3341002Z     D=7168,
2025-05-07T20:31:46.3341249Z     scale_ub=None,
2025-05-07T20:31:46.3341471Z     contiguous=False,
2025-05-07T20:31:46.3341712Z     compiled=False,
2025-05-07T20:31:46.3342671Z )
2025-05-07T20:31:46.3342998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3343502Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.3343772Z 
2025-05-07T20:31:46.3343853Z     @given(
2025-05-07T20:31:46.3344094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3344415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3344731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3345071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3345404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3345701Z     )
2025-05-07T20:31:46.3346059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3346506Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3346764Z         self,
2025-05-07T20:31:46.3346972Z         T: int,
2025-05-07T20:31:46.3347186Z         D: int,
2025-05-07T20:31:46.3347419Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3347710Z         contiguous: bool,
2025-05-07T20:31:46.3347960Z         compiled: bool,
2025-05-07T20:31:46.3348195Z     ) -> None:
2025-05-07T20:31:46.3348425Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3348682Z     
2025-05-07T20:31:46.3348957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3349313Z     
2025-05-07T20:31:46.3349513Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3349811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3350135Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3350387Z         x0 = x[:, :D]
2025-05-07T20:31:46.3350613Z         x1 = x[:, D:]
2025-05-07T20:31:46.3350833Z     
2025-05-07T20:31:46.3351033Z         if contiguous:
2025-05-07T20:31:46.3351265Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3351534Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3351911Z     
2025-05-07T20:31:46.3352109Z         if scale_ub is not None:
2025-05-07T20:31:46.3352398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3352746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3353101Z             )
2025-05-07T20:31:46.3353315Z         else:
2025-05-07T20:31:46.3353536Z             scale_ub_tensor = None
2025-05-07T20:31:46.3353795Z     
2025-05-07T20:31:46.3354029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3354358Z             op = silu_mul_quant
2025-05-07T20:31:46.3354620Z             if compiled:
2025-05-07T20:31:46.3354869Z                 op = torch.compile(op)
2025-05-07T20:31:46.3355175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3355461Z     
2025-05-07T20:31:46.3355654Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3355830Z 
2025-05-07T20:31:46.3355934Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3356241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3356588Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3356886Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3357587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3358293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3358838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3359532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3360211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3360756Z     kernel = self.compile(
2025-05-07T20:31:46.3361301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3362057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3362472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3362708Z 
2025-05-07T20:31:46.3362920Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7093e20>
2025-05-07T20:31:46.3364006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3365381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7023820>}
2025-05-07T20:31:46.3366720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3367770Z context = <triton._C.libtriton.ir.context object at 0x7f2fd738a2b0>
2025-05-07T20:31:46.3368060Z 
2025-05-07T20:31:46.3368227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3368766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3369237Z                            module_map=module_map)
2025-05-07T20:31:46.3369608Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3369967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3370233Z E       ^
2025-05-07T20:31:46.3370697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3371149Z 
2025-05-07T20:31:46.3371571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3372093Z 
2025-05-07T20:31:46.3372204Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3372761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3373164Z     T=2048,
2025-05-07T20:31:46.3373359Z     D=7168,
2025-05-07T20:31:46.3373556Z     scale_ub=None,
2025-05-07T20:31:46.3373773Z     contiguous=False,
2025-05-07T20:31:46.3374008Z     compiled=True,
2025-05-07T20:31:46.3374218Z )
2025-05-07T20:31:46.4543570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.4544355Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.4544757Z 
2025-05-07T20:31:46.4544870Z     @given(
2025-05-07T20:31:46.4545201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.4545643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.4545958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.4546307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.4546667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.4546979Z     )
2025-05-07T20:31:46.4547339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.4547789Z     def test_silu_mul_quant(
2025-05-07T20:31:46.4548035Z         self,
2025-05-07T20:31:46.4548245Z         T: int,
2025-05-07T20:31:46.4548457Z         D: int,
2025-05-07T20:31:46.4548681Z         scale_ub: Optional[float],
2025-05-07T20:31:46.4548970Z         contiguous: bool,
2025-05-07T20:31:46.4549222Z         compiled: bool,
2025-05-07T20:31:46.4549456Z     ) -> None:
2025-05-07T20:31:46.4549687Z         torch.manual_seed(2025)
2025-05-07T20:31:46.4549943Z     
2025-05-07T20:31:46.4550221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.4550578Z     
2025-05-07T20:31:46.4550783Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.4551089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.4551403Z         x = x_sign * x_clamp
2025-05-07T20:31:46.4552055Z         x0 = x[:, :D]
2025-05-07T20:31:46.4552288Z         x1 = x[:, D:]
2025-05-07T20:31:46.4552501Z     
2025-05-07T20:31:46.4552698Z         if contiguous:
2025-05-07T20:31:46.4552969Z             x0 = x0.contiguous()
2025-05-07T20:31:46.4553264Z             x1 = x1.contiguous()
2025-05-07T20:31:46.4553516Z     
2025-05-07T20:31:46.4553721Z         if scale_ub is not None:
2025-05-07T20:31:46.4554003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.4554348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.4554671Z             )
2025-05-07T20:31:46.4554876Z         else:
2025-05-07T20:31:46.4555101Z             scale_ub_tensor = None
2025-05-07T20:31:46.4555361Z     
2025-05-07T20:31:46.4555595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.4555923Z             op = silu_mul_quant
2025-05-07T20:31:46.4556187Z             if compiled:
2025-05-07T20:31:46.4556447Z                 op = torch.compile(op)
2025-05-07T20:31:46.4556766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.4557061Z     
2025-05-07T20:31:46.4557267Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.4557435Z 
2025-05-07T20:31:46.4557540Z moe/activation_test.py:117: 
2025-05-07T20:31:46.4557851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.4558196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.4558482Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.4559049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.4559614Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.4560287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.4560982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.4561535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.4562408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.4563081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.4563633Z     kernel = self.compile(
2025-05-07T20:31:46.4564183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.4564848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.4565256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.4565500Z 
2025-05-07T20:31:46.4565715Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7375ca0>
2025-05-07T20:31:46.4566802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.4568192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fe1790>}
2025-05-07T20:31:46.4569533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.4570555Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ea0cb0>
2025-05-07T20:31:46.4570856Z 
2025-05-07T20:31:46.4571029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.4571564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.4572035Z                            module_map=module_map)
2025-05-07T20:31:46.4572499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.4572879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.4573155Z E       ^
2025-05-07T20:31:46.4573629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.4574089Z 
2025-05-07T20:31:46.4574507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.4575022Z 
2025-05-07T20:31:46.4575137Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.4575562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.4575967Z     T=4096,
2025-05-07T20:31:46.4576168Z     D=7168,
2025-05-07T20:31:46.4576368Z     scale_ub=None,
2025-05-07T20:31:46.4576589Z     contiguous=False,
2025-05-07T20:31:46.4576825Z     compiled=True,
2025-05-07T20:31:46.4577040Z )
2025-05-07T20:31:46.4577363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.4577880Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.4578155Z 
2025-05-07T20:31:46.4578243Z     @given(
2025-05-07T20:31:46.4578478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.4578807Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.4579124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.4579465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.4579801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.4580097Z     )
2025-05-07T20:31:46.4580455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.4580907Z     def test_silu_mul_quant(
2025-05-07T20:31:46.4581278Z         self,
2025-05-07T20:31:46.4581485Z         T: int,
2025-05-07T20:31:46.4581685Z         D: int,
2025-05-07T20:31:46.4581914Z         scale_ub: Optional[float],
2025-05-07T20:31:46.4582198Z         contiguous: bool,
2025-05-07T20:31:46.4582535Z         compiled: bool,
2025-05-07T20:31:46.4582773Z     ) -> None:
2025-05-07T20:31:46.4582999Z         torch.manual_seed(2025)
2025-05-07T20:31:46.4583268Z     
2025-05-07T20:31:46.4583578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.4583948Z     
2025-05-07T20:31:46.4584150Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.4584445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.4584768Z         x = x_sign * x_clamp
2025-05-07T20:31:46.4585022Z         x0 = x[:, :D]
2025-05-07T20:31:46.4585245Z         x1 = x[:, D:]
2025-05-07T20:31:46.4585465Z     
2025-05-07T20:31:46.4585663Z         if contiguous:
2025-05-07T20:31:46.4585900Z             x0 = x0.contiguous()
2025-05-07T20:31:46.4586170Z             x1 = x1.contiguous()
2025-05-07T20:31:46.4586423Z     
2025-05-07T20:31:46.4586619Z         if scale_ub is not None:
2025-05-07T20:31:46.4586906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.4587267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.4587594Z             )
2025-05-07T20:31:46.4587793Z         else:
2025-05-07T20:31:46.4588021Z             scale_ub_tensor = None
2025-05-07T20:31:46.4588286Z     
2025-05-07T20:31:46.4588520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.4588846Z             op = silu_mul_quant
2025-05-07T20:31:46.4589112Z             if compiled:
2025-05-07T20:31:46.4589365Z                 op = torch.compile(op)
2025-05-07T20:31:46.4589675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.4589961Z     
2025-05-07T20:31:46.4590159Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.4590334Z 
2025-05-07T20:31:46.4590438Z moe/activation_test.py:117: 
2025-05-07T20:31:46.4590743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.4591080Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.4591382Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.4592043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.4592621Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.4593341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.4594046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.4594594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.4595277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.4595951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.4596498Z     kernel = self.compile(
2025-05-07T20:31:46.4597048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.4597719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.4598125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.4598361Z 
2025-05-07T20:31:46.4598578Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6ed2d30>
2025-05-07T20:31:46.4599664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.4601027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6f6a4c0>}
2025-05-07T20:31:46.4602372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.4603491Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f84430>
2025-05-07T20:31:46.4603784Z 
2025-05-07T20:31:46.4603966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.4604492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.4604976Z                            module_map=module_map)
2025-05-07T20:31:46.4605353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.4605717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.4605985Z E       ^
2025-05-07T20:31:46.4606457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.4606908Z 
2025-05-07T20:31:46.4607334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.4607854Z 
2025-05-07T20:31:46.6668764Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.6669366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6669959Z     T=16384,
2025-05-07T20:31:46.6670273Z     D=5120,
2025-05-07T20:31:46.6670523Z     scale_ub=1200.0,
2025-05-07T20:31:46.6670757Z     contiguous=False,
2025-05-07T20:31:46.6670992Z     compiled=False,
2025-05-07T20:31:46.6671213Z )
2025-05-07T20:31:46.6671541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6672048Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.6672339Z 
2025-05-07T20:31:46.6672430Z     @given(
2025-05-07T20:31:46.6672667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6672996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6673313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6673662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6674382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6674685Z     )
2025-05-07T20:31:46.6675049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6675501Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6675754Z         self,
2025-05-07T20:31:46.6675962Z         T: int,
2025-05-07T20:31:46.6676165Z         D: int,
2025-05-07T20:31:46.6676396Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6676686Z         contiguous: bool,
2025-05-07T20:31:46.6676931Z         compiled: bool,
2025-05-07T20:31:46.6677168Z     ) -> None:
2025-05-07T20:31:46.6677397Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6677644Z     
2025-05-07T20:31:46.6677929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6678281Z     
2025-05-07T20:31:46.6678479Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.6678780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.6679110Z         x = x_sign * x_clamp
2025-05-07T20:31:46.6679366Z         x0 = x[:, :D]
2025-05-07T20:31:46.6679586Z         x1 = x[:, D:]
2025-05-07T20:31:46.6679804Z     
2025-05-07T20:31:46.6680002Z         if contiguous:
2025-05-07T20:31:46.6680237Z             x0 = x0.contiguous()
2025-05-07T20:31:46.6680508Z             x1 = x1.contiguous()
2025-05-07T20:31:46.6680760Z     
2025-05-07T20:31:46.6680955Z         if scale_ub is not None:
2025-05-07T20:31:46.6681241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.6681588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.6681902Z             )
2025-05-07T20:31:46.6682108Z         else:
2025-05-07T20:31:46.6682332Z             scale_ub_tensor = None
2025-05-07T20:31:46.6682586Z     
2025-05-07T20:31:46.6682831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6683200Z             op = silu_mul_quant
2025-05-07T20:31:46.6683463Z             if compiled:
2025-05-07T20:31:46.6683887Z                 op = torch.compile(op)
2025-05-07T20:31:46.6684201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6684491Z     
2025-05-07T20:31:46.6684687Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.6684864Z 
2025-05-07T20:31:46.6684967Z moe/activation_test.py:117: 
2025-05-07T20:31:46.6685278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6685614Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.6685908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6686613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.6687307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.6687860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.6688558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6689241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.6689780Z     kernel = self.compile(
2025-05-07T20:31:46.6690344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.6691016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.6691426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6691662Z 
2025-05-07T20:31:46.6691876Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6dacf10>
2025-05-07T20:31:46.6692956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.6694479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6f6a820>}
2025-05-07T20:31:46.6695845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.6696866Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6df94f0>
2025-05-07T20:31:46.6697167Z 
2025-05-07T20:31:46.6697339Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.6697874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.6698349Z                            module_map=module_map)
2025-05-07T20:31:46.6698721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.6699088Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.6699355Z E       ^
2025-05-07T20:31:46.6699829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.6700294Z 
2025-05-07T20:31:46.6700712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.6701326Z 
2025-05-07T20:31:46.6701434Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.6701855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6702258Z     T=16384,
2025-05-07T20:31:46.6702461Z     D=5120,
2025-05-07T20:31:46.6702663Z     scale_ub=1200.0,
2025-05-07T20:31:46.6702891Z     contiguous=True,
2025-05-07T20:31:46.6703143Z     compiled=True,
2025-05-07T20:31:46.6703358Z )
2025-05-07T20:31:46.6703716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6704242Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.6704609Z 
2025-05-07T20:31:46.6704705Z     @given(
2025-05-07T20:31:46.6713711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6714057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6714382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6714717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6715058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6715363Z     )
2025-05-07T20:31:46.6715722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6716179Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6716435Z         self,
2025-05-07T20:31:46.6716636Z         T: int,
2025-05-07T20:31:46.6716850Z         D: int,
2025-05-07T20:31:46.6717082Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6717360Z         contiguous: bool,
2025-05-07T20:31:46.6717615Z         compiled: bool,
2025-05-07T20:31:46.6717858Z     ) -> None:
2025-05-07T20:31:46.6718101Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6718361Z     
2025-05-07T20:31:46.6718648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6719007Z     
2025-05-07T20:31:46.6719205Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.6719514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.6719839Z         x = x_sign * x_clamp
2025-05-07T20:31:46.6720088Z         x0 = x[:, :D]
2025-05-07T20:31:46.6720318Z         x1 = x[:, D:]
2025-05-07T20:31:46.6720539Z     
2025-05-07T20:31:46.6720727Z         if contiguous:
2025-05-07T20:31:46.6720975Z             x0 = x0.contiguous()
2025-05-07T20:31:46.6721248Z             x1 = x1.contiguous()
2025-05-07T20:31:46.6721495Z     
2025-05-07T20:31:46.6721698Z         if scale_ub is not None:
2025-05-07T20:31:46.6721984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.6722327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.6722651Z             )
2025-05-07T20:31:46.6722980Z         else:
2025-05-07T20:31:46.6723199Z             scale_ub_tensor = None
2025-05-07T20:31:46.6723463Z     
2025-05-07T20:31:46.6723705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6724032Z             op = silu_mul_quant
2025-05-07T20:31:46.6724288Z             if compiled:
2025-05-07T20:31:46.6724545Z                 op = torch.compile(op)
2025-05-07T20:31:46.6724858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6725136Z     
2025-05-07T20:31:46.6725338Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.6725508Z 
2025-05-07T20:31:46.6725621Z moe/activation_test.py:117: 
2025-05-07T20:31:46.6725920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6726264Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.6726557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6727120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.6727703Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.6728383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.6729080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.6729628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.6730320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6730995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.6731538Z     kernel = self.compile(
2025-05-07T20:31:46.6732085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.6732758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.6733277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6733516Z 
2025-05-07T20:31:46.6733730Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6e135e0>
2025-05-07T20:31:46.6734819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.6736203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6df6e50>}
2025-05-07T20:31:46.6737558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.6738582Z context = <triton._C.libtriton.ir.context object at 0x7f2fd717d4b0>
2025-05-07T20:31:46.6738887Z 
2025-05-07T20:31:46.6739062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.6739609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.6740408Z                            module_map=module_map)
2025-05-07T20:31:46.6740801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.6741219Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.6741495Z E       ^
2025-05-07T20:31:46.6741970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.6742419Z 
2025-05-07T20:31:46.6742842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.6743364Z 
2025-05-07T20:31:47.1105424Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.1106526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.1107137Z     T=16384,
2025-05-07T20:31:47.1107427Z     D=5120,
2025-05-07T20:31:47.1107699Z     scale_ub=None,
2025-05-07T20:31:47.1107988Z     contiguous=False,
2025-05-07T20:31:47.1108295Z     compiled=True,
2025-05-07T20:31:47.1108579Z )
2025-05-07T20:31:47.1108953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.1109470Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:47.1109765Z 
2025-05-07T20:31:47.1109850Z     @given(
2025-05-07T20:31:47.1110103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.1110425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.1110748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.1111095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.1111433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.1111752Z     )
2025-05-07T20:31:47.1112131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.1112589Z     def test_silu_mul_quant(
2025-05-07T20:31:47.1112845Z         self,
2025-05-07T20:31:47.1113058Z         T: int,
2025-05-07T20:31:47.1113275Z         D: int,
2025-05-07T20:31:47.1113506Z         scale_ub: Optional[float],
2025-05-07T20:31:47.1113797Z         contiguous: bool,
2025-05-07T20:31:47.1114060Z         compiled: bool,
2025-05-07T20:31:47.1114296Z     ) -> None:
2025-05-07T20:31:47.1114532Z         torch.manual_seed(2025)
2025-05-07T20:31:47.1114793Z     
2025-05-07T20:31:47.1115083Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.1115442Z     
2025-05-07T20:31:47.1115648Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.1115956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.1116278Z         x = x_sign * x_clamp
2025-05-07T20:31:47.1116534Z         x0 = x[:, :D]
2025-05-07T20:31:47.1116764Z         x1 = x[:, D:]
2025-05-07T20:31:47.1117171Z     
2025-05-07T20:31:47.1117375Z         if contiguous:
2025-05-07T20:31:47.1117621Z             x0 = x0.contiguous()
2025-05-07T20:31:47.1117892Z             x1 = x1.contiguous()
2025-05-07T20:31:47.1118150Z     
2025-05-07T20:31:47.1118359Z         if scale_ub is not None:
2025-05-07T20:31:47.1118648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.1118993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.1119317Z             )
2025-05-07T20:31:47.1119527Z         else:
2025-05-07T20:31:47.1119748Z             scale_ub_tensor = None
2025-05-07T20:31:47.1120013Z     
2025-05-07T20:31:47.1120261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.1120581Z             op = silu_mul_quant
2025-05-07T20:31:47.1120848Z             if compiled:
2025-05-07T20:31:47.1121114Z                 op = torch.compile(op)
2025-05-07T20:31:47.1121418Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1121717Z     
2025-05-07T20:31:47.1121921Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.1122091Z 
2025-05-07T20:31:47.1122199Z moe/activation_test.py:117: 
2025-05-07T20:31:47.1122511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1122854Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.1123146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1123710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.1124284Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.1124953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.1125654Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.1126208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.1127045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.1127734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.1128272Z     kernel = self.compile(
2025-05-07T20:31:47.1128823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.1129491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.1129892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1130134Z 
2025-05-07T20:31:47.1130346Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7179af0>
2025-05-07T20:31:47.1131433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.1132826Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6d799d0>}
2025-05-07T20:31:47.1134168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.1135188Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f33b30>
2025-05-07T20:31:47.1135485Z 
2025-05-07T20:31:47.1135656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.1136188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.1136659Z                            module_map=module_map)
2025-05-07T20:31:47.1137031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.1137396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.1137756Z E       ^
2025-05-07T20:31:47.1138221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.1138677Z 
2025-05-07T20:31:47.1139097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.1139620Z 
2025-05-07T20:31:47.1139727Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.1140456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.1140862Z     T=2048,
2025-05-07T20:31:47.1141058Z     D=5120,
2025-05-07T20:31:47.1141322Z     scale_ub=None,
2025-05-07T20:31:47.1141542Z     contiguous=False,
2025-05-07T20:31:47.1141780Z     compiled=True,
2025-05-07T20:31:47.1141993Z )
2025-05-07T20:31:47.2350881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2351678Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:47.2352099Z 
2025-05-07T20:31:47.2352191Z     @given(
2025-05-07T20:31:47.2352522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2352974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2353402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2353748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2354093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2354383Z     )
2025-05-07T20:31:47.2354742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2355194Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2355439Z         self,
2025-05-07T20:31:47.2355641Z         T: int,
2025-05-07T20:31:47.2355845Z         D: int,
2025-05-07T20:31:47.2356071Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2356345Z         contiguous: bool,
2025-05-07T20:31:47.2356591Z         compiled: bool,
2025-05-07T20:31:47.2356831Z     ) -> None:
2025-05-07T20:31:47.2357388Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2357645Z     
2025-05-07T20:31:47.2357925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2358271Z     
2025-05-07T20:31:47.2358472Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2358773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2359091Z         x = x_sign * x_clamp
2025-05-07T20:31:47.2359345Z         x0 = x[:, :D]
2025-05-07T20:31:47.2359572Z         x1 = x[:, D:]
2025-05-07T20:31:47.2359782Z     
2025-05-07T20:31:47.2359974Z         if contiguous:
2025-05-07T20:31:47.2360214Z             x0 = x0.contiguous()
2025-05-07T20:31:47.2360477Z             x1 = x1.contiguous()
2025-05-07T20:31:47.2360730Z     
2025-05-07T20:31:47.2360930Z         if scale_ub is not None:
2025-05-07T20:31:47.2361209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.2361555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.2361885Z             )
2025-05-07T20:31:47.2362085Z         else:
2025-05-07T20:31:47.2362297Z             scale_ub_tensor = None
2025-05-07T20:31:47.2362559Z     
2025-05-07T20:31:47.2362800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.2363119Z             op = silu_mul_quant
2025-05-07T20:31:47.2363383Z             if compiled:
2025-05-07T20:31:47.2363641Z                 op = torch.compile(op)
2025-05-07T20:31:47.2363943Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2364232Z     
2025-05-07T20:31:47.2364433Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.2364603Z 
2025-05-07T20:31:47.2364710Z moe/activation_test.py:117: 
2025-05-07T20:31:47.2365016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2365359Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.2365655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2366227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.2366980Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.2367651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.2368340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.2368894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.2369587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.2370258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.2370789Z     kernel = self.compile(
2025-05-07T20:31:47.2371343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.2372011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.2372417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2372656Z 
2025-05-07T20:31:47.2372868Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6f43b50>
2025-05-07T20:31:47.2373998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.2375382Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6caa550>}
2025-05-07T20:31:47.2376722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.2377821Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6cc40f0>
2025-05-07T20:31:47.2378127Z 
2025-05-07T20:31:47.2378297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.2378826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.2379302Z                            module_map=module_map)
2025-05-07T20:31:47.2379670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.2380031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.2380300Z E       ^
2025-05-07T20:31:47.2380764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.2381345Z 
2025-05-07T20:31:47.2381770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.2382296Z 
2025-05-07T20:31:47.2382403Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2382836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2383239Z     T=2048,
2025-05-07T20:31:47.2383457Z     D=5120,
2025-05-07T20:31:47.2383658Z     scale_ub=1200.0,
2025-05-07T20:31:47.2383896Z     contiguous=False,
2025-05-07T20:31:47.2384124Z     compiled=True,
2025-05-07T20:31:47.2384342Z )
2025-05-07T20:31:47.2384669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2385167Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.2385450Z 
2025-05-07T20:31:47.2385532Z     @given(
2025-05-07T20:31:47.2385771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2386099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2386411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2386750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2387086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2387468Z     )
2025-05-07T20:31:47.2387825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2388276Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2388523Z         self,
2025-05-07T20:31:47.2388725Z         T: int,
2025-05-07T20:31:47.2388933Z         D: int,
2025-05-07T20:31:47.2389155Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2389436Z         contiguous: bool,
2025-05-07T20:31:47.2389689Z         compiled: bool,
2025-05-07T20:31:47.2389913Z     ) -> None:
2025-05-07T20:31:47.2390137Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2390391Z     
2025-05-07T20:31:47.2390664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2391018Z     
2025-05-07T20:31:47.2391217Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2391515Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2391826Z         x = x_sign * x_clamp
2025-05-07T20:31:47.2392089Z         x0 = x[:, :D]
2025-05-07T20:31:47.2392313Z         x1 = x[:, D:]
2025-05-07T20:31:47.2392523Z     
2025-05-07T20:31:47.2392713Z         if contiguous:
2025-05-07T20:31:47.2392950Z             x0 = x0.contiguous()
2025-05-07T20:31:47.2393211Z             x1 = x1.contiguous()
2025-05-07T20:31:47.2393458Z     
2025-05-07T20:31:47.2393656Z         if scale_ub is not None:
2025-05-07T20:31:47.2393930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.2394275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.2394592Z             )
2025-05-07T20:31:47.2394785Z         else:
2025-05-07T20:31:47.2395004Z             scale_ub_tensor = None
2025-05-07T20:31:47.2395267Z     
2025-05-07T20:31:47.2395506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.2395830Z             op = silu_mul_quant
2025-05-07T20:31:47.2396085Z             if compiled:
2025-05-07T20:31:47.2396341Z                 op = torch.compile(op)
2025-05-07T20:31:47.2396735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2397017Z     
2025-05-07T20:31:47.2397216Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.2397387Z 
2025-05-07T20:31:47.2397496Z moe/activation_test.py:117: 
2025-05-07T20:31:47.2397793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2398135Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.2398426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2398986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.2399558Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.2400228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.2400921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.2401467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.2402175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.2402850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.2403391Z     kernel = self.compile(
2025-05-07T20:31:47.2403991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.2404650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.2405057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2405290Z 
2025-05-07T20:31:47.2405499Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6e7a850>
2025-05-07T20:31:47.2406585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.2408049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6c3b310>}
2025-05-07T20:31:47.2409390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.2410412Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c15770>
2025-05-07T20:31:47.2410703Z 
2025-05-07T20:31:47.2410872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.2411399Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.2411871Z                            module_map=module_map)
2025-05-07T20:31:47.2412244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.2412612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.2412881Z E       ^
2025-05-07T20:31:47.2413354Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.2413803Z 
2025-05-07T20:31:47.2414228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.2414749Z 
2025-05-07T20:31:47.4663936Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.4664618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.4665194Z     T=4096,
2025-05-07T20:31:47.4665463Z     D=5120,
2025-05-07T20:31:47.4665727Z     scale_ub=1200.0,
2025-05-07T20:31:47.4665951Z     contiguous=True,
2025-05-07T20:31:47.4666180Z     compiled=True,
2025-05-07T20:31:47.4666395Z )
2025-05-07T20:31:47.4666719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.4667668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.4667951Z 
2025-05-07T20:31:47.4668043Z     @given(
2025-05-07T20:31:47.4668278Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.4668599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.4668918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.4669250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.4669592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.4669889Z     )
2025-05-07T20:31:47.4670245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.4670687Z     def test_silu_mul_quant(
2025-05-07T20:31:47.4670938Z         self,
2025-05-07T20:31:47.4671141Z         T: int,
2025-05-07T20:31:47.4671340Z         D: int,
2025-05-07T20:31:47.4671574Z         scale_ub: Optional[float],
2025-05-07T20:31:47.4671858Z         contiguous: bool,
2025-05-07T20:31:47.4672112Z         compiled: bool,
2025-05-07T20:31:47.4672352Z     ) -> None:
2025-05-07T20:31:47.4672580Z         torch.manual_seed(2025)
2025-05-07T20:31:47.4672823Z     
2025-05-07T20:31:47.4673106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.4673459Z     
2025-05-07T20:31:47.4673659Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.4673967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.4674291Z         x = x_sign * x_clamp
2025-05-07T20:31:47.4674567Z         x0 = x[:, :D]
2025-05-07T20:31:47.4674784Z         x1 = x[:, D:]
2025-05-07T20:31:47.4675005Z     
2025-05-07T20:31:47.4675202Z         if contiguous:
2025-05-07T20:31:47.4675439Z             x0 = x0.contiguous()
2025-05-07T20:31:47.4675711Z             x1 = x1.contiguous()
2025-05-07T20:31:47.4675965Z     
2025-05-07T20:31:47.4684192Z         if scale_ub is not None:
2025-05-07T20:31:47.4684523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.4685110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.4685432Z             )
2025-05-07T20:31:47.4685644Z         else:
2025-05-07T20:31:47.4685867Z             scale_ub_tensor = None
2025-05-07T20:31:47.4686125Z     
2025-05-07T20:31:47.4686371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.4686701Z             op = silu_mul_quant
2025-05-07T20:31:47.4686961Z             if compiled:
2025-05-07T20:31:47.4687224Z                 op = torch.compile(op)
2025-05-07T20:31:47.4687537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4687822Z     
2025-05-07T20:31:47.4688029Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.4688205Z 
2025-05-07T20:31:47.4688310Z moe/activation_test.py:117: 
2025-05-07T20:31:47.4688619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4688962Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.4689258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4689868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.4690444Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.4691108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.4691800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.4692346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.4693027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.4693696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.4694236Z     kernel = self.compile(
2025-05-07T20:31:47.4694783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.4695540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.4695954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4696194Z 
2025-05-07T20:31:47.4696415Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6bd06d0>
2025-05-07T20:31:47.4697514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.4698901Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fb3040>}
2025-05-07T20:31:47.4700246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.4701359Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f96eb0>
2025-05-07T20:31:47.4701652Z 
2025-05-07T20:31:47.4701833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.4702364Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.4702835Z                            module_map=module_map)
2025-05-07T20:31:47.4703210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.4703574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.4703834Z E       ^
2025-05-07T20:31:47.4704308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.4704763Z 
2025-05-07T20:31:47.4705189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.4705699Z 
2025-05-07T20:31:47.4705902Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.4706319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.4706726Z     T=128,
2025-05-07T20:31:47.4706922Z     D=5120,
2025-05-07T20:31:47.4707115Z     scale_ub=1200.0,
2025-05-07T20:31:47.4707352Z     contiguous=False,
2025-05-07T20:31:47.4707587Z     compiled=True,
2025-05-07T20:31:47.4707795Z )
2025-05-07T20:31:47.6024284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6025082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.6025483Z 
2025-05-07T20:31:47.6025619Z     @given(
2025-05-07T20:31:47.6025957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6026401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6026795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6027145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6027522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6027825Z     )
2025-05-07T20:31:47.6028193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6028650Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6028907Z         self,
2025-05-07T20:31:47.6029118Z         T: int,
2025-05-07T20:31:47.6029324Z         D: int,
2025-05-07T20:31:47.6029559Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6029852Z         contiguous: bool,
2025-05-07T20:31:47.6030096Z         compiled: bool,
2025-05-07T20:31:47.6030339Z     ) -> None:
2025-05-07T20:31:47.6030573Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6030821Z     
2025-05-07T20:31:47.6031105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6031469Z     
2025-05-07T20:31:47.6031674Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.6031971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.6032669Z         x = x_sign * x_clamp
2025-05-07T20:31:47.6032936Z         x0 = x[:, :D]
2025-05-07T20:31:47.6033156Z         x1 = x[:, D:]
2025-05-07T20:31:47.6033375Z     
2025-05-07T20:31:47.6033580Z         if contiguous:
2025-05-07T20:31:47.6033821Z             x0 = x0.contiguous()
2025-05-07T20:31:47.6034100Z             x1 = x1.contiguous()
2025-05-07T20:31:47.6034353Z     
2025-05-07T20:31:47.6034550Z         if scale_ub is not None:
2025-05-07T20:31:47.6034838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.6035196Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.6035513Z             )
2025-05-07T20:31:47.6035717Z         else:
2025-05-07T20:31:47.6035941Z             scale_ub_tensor = None
2025-05-07T20:31:47.6036199Z     
2025-05-07T20:31:47.6036442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.6036770Z             op = silu_mul_quant
2025-05-07T20:31:47.6037034Z             if compiled:
2025-05-07T20:31:47.6037292Z                 op = torch.compile(op)
2025-05-07T20:31:47.6037616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.6037912Z     
2025-05-07T20:31:47.6038109Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.6038292Z 
2025-05-07T20:31:47.6038398Z moe/activation_test.py:117: 
2025-05-07T20:31:47.6038712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.6039050Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.6039351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.6039930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.6040838Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.6041513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.6042215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.6042776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.6043661Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.6044348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.6044892Z     kernel = self.compile(
2025-05-07T20:31:47.6045453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.6046122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.6046541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.6046776Z 
2025-05-07T20:31:47.6046990Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6fc0a00>
2025-05-07T20:31:47.6048083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.6049478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fb3ca0>}
2025-05-07T20:31:47.6050814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.6051833Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c6b4b0>
2025-05-07T20:31:47.6052124Z 
2025-05-07T20:31:47.6052298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.6052826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.6053294Z                            module_map=module_map)
2025-05-07T20:31:47.6053792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.6054155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.6054425Z E       ^
2025-05-07T20:31:47.6054893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.6055342Z 
2025-05-07T20:31:47.6055757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.6056276Z 
2025-05-07T20:31:47.6056383Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6056804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6057219Z     T=16384,
2025-05-07T20:31:47.6057416Z     D=7168,
2025-05-07T20:31:47.6057618Z     scale_ub=1200.0,
2025-05-07T20:31:47.6057877Z     contiguous=True,
2025-05-07T20:31:47.6058109Z     compiled=True,
2025-05-07T20:31:47.6058321Z )
2025-05-07T20:31:47.6058656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6059170Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.6059447Z 
2025-05-07T20:31:47.6059528Z     @given(
2025-05-07T20:31:47.6059774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6060103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6060414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6060754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6061093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6061471Z     )
2025-05-07T20:31:47.6061827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6062276Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6062528Z         self,
2025-05-07T20:31:47.6062726Z         T: int,
2025-05-07T20:31:47.6062931Z         D: int,
2025-05-07T20:31:47.6063159Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6063533Z         contiguous: bool,
2025-05-07T20:31:47.6063780Z         compiled: bool,
2025-05-07T20:31:47.6064012Z     ) -> None:
2025-05-07T20:31:47.6064232Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6064483Z     
2025-05-07T20:31:47.6064766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6065107Z     
2025-05-07T20:31:47.6065312Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.6065613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.6065931Z         x = x_sign * x_clamp
2025-05-07T20:31:47.6066176Z         x0 = x[:, :D]
2025-05-07T20:31:47.6066406Z         x1 = x[:, D:]
2025-05-07T20:31:47.6066622Z     
2025-05-07T20:31:47.6066810Z         if contiguous:
2025-05-07T20:31:47.6067054Z             x0 = x0.contiguous()
2025-05-07T20:31:47.6067323Z             x1 = x1.contiguous()
2025-05-07T20:31:47.6067568Z     
2025-05-07T20:31:47.6067767Z         if scale_ub is not None:
2025-05-07T20:31:47.6068058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.6068402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.6068727Z             )
2025-05-07T20:31:47.6068930Z         else:
2025-05-07T20:31:47.6069145Z             scale_ub_tensor = None
2025-05-07T20:31:47.6069406Z     
2025-05-07T20:31:47.6069646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.6069968Z             op = silu_mul_quant
2025-05-07T20:31:47.6070231Z             if compiled:
2025-05-07T20:31:47.6070488Z                 op = torch.compile(op)
2025-05-07T20:31:47.6070795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.6071079Z     
2025-05-07T20:31:47.6071281Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.6071451Z 
2025-05-07T20:31:47.6071561Z moe/activation_test.py:117: 
2025-05-07T20:31:47.6071857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.6072204Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.6072618Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.6073185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.6073757Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.6074609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.6075324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.6075866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.6076559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.6077235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.6078011Z     kernel = self.compile(
2025-05-07T20:31:47.6078793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.6079472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.6079881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.6080116Z 
2025-05-07T20:31:47.6080330Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6c86910>
2025-05-07T20:31:47.6081428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.6082799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ceaa60>}
2025-05-07T20:31:47.6084147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.6085338Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6a45a30>
2025-05-07T20:31:47.6085632Z 
2025-05-07T20:31:47.6085804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.6086340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.6086817Z                            module_map=module_map)
2025-05-07T20:31:47.6087189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.6087558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.6087829Z E       ^
2025-05-07T20:31:47.6088298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.6088747Z 
2025-05-07T20:31:47.6089167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.6089705Z 
2025-05-07T20:31:48.1155241Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1155960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1156546Z     T=16384,
2025-05-07T20:31:48.1156814Z     D=5120,
2025-05-07T20:31:48.1157070Z     scale_ub=1200.0,
2025-05-07T20:31:48.1157370Z     contiguous=True,
2025-05-07T20:31:48.1157671Z     compiled=False,
2025-05-07T20:31:48.1157951Z )
2025-05-07T20:31:48.1158280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1158796Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.1159080Z 
2025-05-07T20:31:48.1159171Z     @given(
2025-05-07T20:31:48.1159410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1159741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1160063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1160758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1161130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1161435Z     )
2025-05-07T20:31:48.1161797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1162250Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1162507Z         self,
2025-05-07T20:31:48.1162716Z         T: int,
2025-05-07T20:31:48.1162921Z         D: int,
2025-05-07T20:31:48.1163153Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1163438Z         contiguous: bool,
2025-05-07T20:31:48.1163685Z         compiled: bool,
2025-05-07T20:31:48.1163925Z     ) -> None:
2025-05-07T20:31:48.1164158Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1164409Z     
2025-05-07T20:31:48.1164696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1165056Z     
2025-05-07T20:31:48.1165258Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.1165574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.1165905Z         x = x_sign * x_clamp
2025-05-07T20:31:48.1166153Z         x0 = x[:, :D]
2025-05-07T20:31:48.1166384Z         x1 = x[:, D:]
2025-05-07T20:31:48.1166608Z     
2025-05-07T20:31:48.1166802Z         if contiguous:
2025-05-07T20:31:48.1167056Z             x0 = x0.contiguous()
2025-05-07T20:31:48.1167332Z             x1 = x1.contiguous()
2025-05-07T20:31:48.1167589Z     
2025-05-07T20:31:48.1167790Z         if scale_ub is not None:
2025-05-07T20:31:48.1168080Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.1168439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.1168761Z             )
2025-05-07T20:31:48.1168971Z         else:
2025-05-07T20:31:48.1169195Z             scale_ub_tensor = None
2025-05-07T20:31:48.1169453Z     
2025-05-07T20:31:48.1169704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.1170035Z             op = silu_mul_quant
2025-05-07T20:31:48.1170302Z             if compiled:
2025-05-07T20:31:48.1170739Z                 op = torch.compile(op)
2025-05-07T20:31:48.1171048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1171332Z     
2025-05-07T20:31:48.1171538Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.1171709Z 
2025-05-07T20:31:48.1171821Z moe/activation_test.py:117: 
2025-05-07T20:31:48.1172130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1172469Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.1172765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1173470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.1174166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.1174716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.1175417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.1176100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.1176637Z     kernel = self.compile(
2025-05-07T20:31:48.1177198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.1177872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.1178277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1178522Z 
2025-05-07T20:31:48.1178735Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a1f460>
2025-05-07T20:31:48.1179824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.1181422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6c08550>}
2025-05-07T20:31:48.1182774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.1183799Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c134b0>
2025-05-07T20:31:48.1184149Z 
2025-05-07T20:31:48.1184320Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.1184860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.1185335Z                            module_map=module_map)
2025-05-07T20:31:48.1185707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.1186073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.1186351Z E       ^
2025-05-07T20:31:48.1186845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.1187427Z 
2025-05-07T20:31:48.1188044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.1188643Z 
2025-05-07T20:31:48.1188753Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1189182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1189594Z     T=1,
2025-05-07T20:31:48.1189792Z     D=7168,
2025-05-07T20:31:48.1190004Z     scale_ub=1200.0,
2025-05-07T20:31:48.1190237Z     contiguous=False,
2025-05-07T20:31:48.1190479Z     compiled=False,
2025-05-07T20:31:48.1190700Z )
2025-05-07T20:31:48.1191031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1191542Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.1191918Z 
2025-05-07T20:31:48.1192008Z     @given(
2025-05-07T20:31:48.1192251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1192571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1192894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1193239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1193576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1193874Z     )
2025-05-07T20:31:48.1194235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1194683Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1194940Z         self,
2025-05-07T20:31:48.1195148Z         T: int,
2025-05-07T20:31:48.1195348Z         D: int,
2025-05-07T20:31:48.1195575Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1195858Z         contiguous: bool,
2025-05-07T20:31:48.1196101Z         compiled: bool,
2025-05-07T20:31:48.1196337Z     ) -> None:
2025-05-07T20:31:48.1196574Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1196828Z     
2025-05-07T20:31:48.1197102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1197461Z     
2025-05-07T20:31:48.1197667Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.1197964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.1198290Z         x = x_sign * x_clamp
2025-05-07T20:31:48.1198543Z         x0 = x[:, :D]
2025-05-07T20:31:48.1198765Z         x1 = x[:, D:]
2025-05-07T20:31:48.1198986Z     
2025-05-07T20:31:48.1199180Z         if contiguous:
2025-05-07T20:31:48.1199416Z             x0 = x0.contiguous()
2025-05-07T20:31:48.1199690Z             x1 = x1.contiguous()
2025-05-07T20:31:48.1199946Z     
2025-05-07T20:31:48.1200141Z         if scale_ub is not None:
2025-05-07T20:31:48.1200428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.1200775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.1201106Z             )
2025-05-07T20:31:48.1201402Z         else:
2025-05-07T20:31:48.1201627Z             scale_ub_tensor = None
2025-05-07T20:31:48.1201884Z     
2025-05-07T20:31:48.1202128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.1202456Z             op = silu_mul_quant
2025-05-07T20:31:48.1202722Z             if compiled:
2025-05-07T20:31:48.1202983Z                 op = torch.compile(op)
2025-05-07T20:31:48.1203295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1203582Z     
2025-05-07T20:31:48.1203779Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.1203958Z 
2025-05-07T20:31:48.1204062Z moe/activation_test.py:117: 
2025-05-07T20:31:48.1204370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1204710Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.1205005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1205710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.1206416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.1206963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.1207660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.1208337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.1208877Z     kernel = self.compile(
2025-05-07T20:31:48.1209429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.1210098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.1210507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1210743Z 
2025-05-07T20:31:48.1210958Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a149a0>
2025-05-07T20:31:48.1212165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.1213534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ceae50>}
2025-05-07T20:31:48.1214881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.1215908Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6b5f4f0>
2025-05-07T20:31:48.1216201Z 
2025-05-07T20:31:48.1216373Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.1216911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.1217404Z                            module_map=module_map)
2025-05-07T20:31:48.1217775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.1218158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.1218432Z E       ^
2025-05-07T20:31:48.1218895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.1219355Z 
2025-05-07T20:31:48.1219777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.1229763Z 
2025-05-07T20:31:48.1229910Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1230349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1230770Z     T=4096,
2025-05-07T20:31:48.1230975Z     D=7168,
2025-05-07T20:31:48.1231176Z     scale_ub=1200.0,
2025-05-07T20:31:48.1231546Z     contiguous=False,
2025-05-07T20:31:48.1231793Z     compiled=True,
2025-05-07T20:31:48.1232012Z )
2025-05-07T20:31:48.2403054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2403877Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.2404225Z 
2025-05-07T20:31:48.2404313Z     @given(
2025-05-07T20:31:48.2404565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2404898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2405218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2405578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2405928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2406221Z     )
2025-05-07T20:31:48.2406588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2407052Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2407345Z         self,
2025-05-07T20:31:48.2407548Z         T: int,
2025-05-07T20:31:48.2407759Z         D: int,
2025-05-07T20:31:48.2407992Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2408269Z         contiguous: bool,
2025-05-07T20:31:48.2408527Z         compiled: bool,
2025-05-07T20:31:48.2408768Z     ) -> None:
2025-05-07T20:31:48.2408991Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2409249Z     
2025-05-07T20:31:48.2409535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2409884Z     
2025-05-07T20:31:48.2410096Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.2410479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.2410803Z         x = x_sign * x_clamp
2025-05-07T20:31:48.2411059Z         x0 = x[:, :D]
2025-05-07T20:31:48.2411287Z         x1 = x[:, D:]
2025-05-07T20:31:48.2411505Z     
2025-05-07T20:31:48.2411702Z         if contiguous:
2025-05-07T20:31:48.2411950Z             x0 = x0.contiguous()
2025-05-07T20:31:48.2412583Z             x1 = x1.contiguous()
2025-05-07T20:31:48.2412839Z     
2025-05-07T20:31:48.2413048Z         if scale_ub is not None:
2025-05-07T20:31:48.2413336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.2413678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.2414001Z             )
2025-05-07T20:31:48.2414207Z         else:
2025-05-07T20:31:48.2414426Z             scale_ub_tensor = None
2025-05-07T20:31:48.2414694Z     
2025-05-07T20:31:48.2414941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.2415264Z             op = silu_mul_quant
2025-05-07T20:31:48.2415531Z             if compiled:
2025-05-07T20:31:48.2415793Z                 op = torch.compile(op)
2025-05-07T20:31:48.2416098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2416390Z     
2025-05-07T20:31:48.2416597Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.2416772Z 
2025-05-07T20:31:48.2416880Z moe/activation_test.py:117: 
2025-05-07T20:31:48.2417209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2417559Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.2417858Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2418429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.2419007Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.2419681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.2420373Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.2420923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.2421766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.2422601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.2423158Z     kernel = self.compile(
2025-05-07T20:31:48.2423714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.2424386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.2424791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2425036Z 
2025-05-07T20:31:48.2425249Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a3f5b0>
2025-05-07T20:31:48.2426345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.2427764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6b5eee0>}
2025-05-07T20:31:48.2429138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.2430168Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6af6970>
2025-05-07T20:31:48.2430467Z 
2025-05-07T20:31:48.2430642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.2431183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.2431660Z                            module_map=module_map)
2025-05-07T20:31:48.2432034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.2432402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.2432678Z E       ^
2025-05-07T20:31:48.2433154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.2433708Z 
2025-05-07T20:31:48.2434132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.2434656Z 
2025-05-07T20:31:48.2434765Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.2435188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.2435596Z     T=128,
2025-05-07T20:31:48.2435798Z     D=7168,
2025-05-07T20:31:48.2436005Z     scale_ub=1200.0,
2025-05-07T20:31:48.2436242Z     contiguous=False,
2025-05-07T20:31:48.2436484Z     compiled=True,
2025-05-07T20:31:48.2436702Z )
2025-05-07T20:31:48.2437027Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2437535Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.2437816Z 
2025-05-07T20:31:48.2437899Z     @given(
2025-05-07T20:31:48.2438143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2438478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2438804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2439152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2439489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2439793Z     )
2025-05-07T20:31:48.2440438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2440902Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2441159Z         self,
2025-05-07T20:31:48.2441367Z         T: int,
2025-05-07T20:31:48.2441570Z         D: int,
2025-05-07T20:31:48.2441807Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2442097Z         contiguous: bool,
2025-05-07T20:31:48.2442350Z         compiled: bool,
2025-05-07T20:31:48.2442581Z     ) -> None:
2025-05-07T20:31:48.2442814Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2443069Z     
2025-05-07T20:31:48.2443495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2443863Z     
2025-05-07T20:31:48.2444060Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.2444365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.2444688Z         x = x_sign * x_clamp
2025-05-07T20:31:48.2444933Z         x0 = x[:, :D]
2025-05-07T20:31:48.2445161Z         x1 = x[:, D:]
2025-05-07T20:31:48.2445382Z     
2025-05-07T20:31:48.2445571Z         if contiguous:
2025-05-07T20:31:48.2445821Z             x0 = x0.contiguous()
2025-05-07T20:31:48.2446091Z             x1 = x1.contiguous()
2025-05-07T20:31:48.2446340Z     
2025-05-07T20:31:48.2446542Z         if scale_ub is not None:
2025-05-07T20:31:48.2446829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.2447180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.2447495Z             )
2025-05-07T20:31:48.2447702Z         else:
2025-05-07T20:31:48.2447925Z             scale_ub_tensor = None
2025-05-07T20:31:48.2448192Z     
2025-05-07T20:31:48.2448443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.2448772Z             op = silu_mul_quant
2025-05-07T20:31:48.2449027Z             if compiled:
2025-05-07T20:31:48.2449285Z                 op = torch.compile(op)
2025-05-07T20:31:48.2449592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2449871Z     
2025-05-07T20:31:48.2450074Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.2450242Z 
2025-05-07T20:31:48.2450353Z moe/activation_test.py:117: 
2025-05-07T20:31:48.2450653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2451000Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.2451293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2451859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.2452423Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.2453097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.2453932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.2454472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.2455171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.2455839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.2456382Z     kernel = self.compile(
2025-05-07T20:31:48.2456930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.2457596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.2458012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2458247Z 
2025-05-07T20:31:48.2458479Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6d16910>
2025-05-07T20:31:48.2459553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.2460925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6b1eaf0>}
2025-05-07T20:31:48.2462397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.2463428Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6a10fb0>
2025-05-07T20:31:48.2463724Z 
2025-05-07T20:31:48.2463900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.2464528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.2465017Z                            module_map=module_map)
2025-05-07T20:31:48.2465398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.2465759Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.2466034Z E       ^
2025-05-07T20:31:48.2466515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.2466970Z 
2025-05-07T20:31:48.2467390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.2467911Z 
2025-05-07T20:31:48.4187548Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4188225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4188812Z     T=2048,
2025-05-07T20:31:48.4189113Z     D=7168,
2025-05-07T20:31:48.4189371Z     scale_ub=None,
2025-05-07T20:31:48.4189599Z     contiguous=True,
2025-05-07T20:31:48.4189830Z     compiled=True,
2025-05-07T20:31:48.4190058Z )
2025-05-07T20:31:48.4190419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4190929Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.4191208Z 
2025-05-07T20:31:48.4191299Z     @given(
2025-05-07T20:31:48.4191535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4191870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4192190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4192537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4192880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4193182Z     )
2025-05-07T20:31:48.4193545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4194387Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4194643Z         self,
2025-05-07T20:31:48.4194851Z         T: int,
2025-05-07T20:31:48.4195053Z         D: int,
2025-05-07T20:31:48.4195285Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4195570Z         contiguous: bool,
2025-05-07T20:31:48.4195816Z         compiled: bool,
2025-05-07T20:31:48.4196058Z     ) -> None:
2025-05-07T20:31:48.4196288Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4196539Z     
2025-05-07T20:31:48.4196825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4197185Z     
2025-05-07T20:31:48.4197390Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.4197688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.4198014Z         x = x_sign * x_clamp
2025-05-07T20:31:48.4198269Z         x0 = x[:, :D]
2025-05-07T20:31:48.4198494Z         x1 = x[:, D:]
2025-05-07T20:31:48.4198714Z     
2025-05-07T20:31:48.4198914Z         if contiguous:
2025-05-07T20:31:48.4199165Z             x0 = x0.contiguous()
2025-05-07T20:31:48.4199442Z             x1 = x1.contiguous()
2025-05-07T20:31:48.4199697Z     
2025-05-07T20:31:48.4199896Z         if scale_ub is not None:
2025-05-07T20:31:48.4200188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.4200540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.4200860Z             )
2025-05-07T20:31:48.4201067Z         else:
2025-05-07T20:31:48.4201292Z             scale_ub_tensor = None
2025-05-07T20:31:48.4201551Z     
2025-05-07T20:31:48.4201798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.4202127Z             op = silu_mul_quant
2025-05-07T20:31:48.4202395Z             if compiled:
2025-05-07T20:31:48.4202656Z                 op = torch.compile(op)
2025-05-07T20:31:48.4202971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.4203264Z     
2025-05-07T20:31:48.4203464Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.4203641Z 
2025-05-07T20:31:48.4203899Z moe/activation_test.py:117: 
2025-05-07T20:31:48.4204214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.4204556Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.4204856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.4205430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.4205998Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.4206659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.4207354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.4207901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.4208584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.4209258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.4209802Z     kernel = self.compile(
2025-05-07T20:31:48.4210360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.4211017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.4211422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.4211656Z 
2025-05-07T20:31:48.4211878Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd68a09d0>
2025-05-07T20:31:48.4212962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.4214362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd68b48b0>}
2025-05-07T20:31:48.4215793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.4216822Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ab48f0>
2025-05-07T20:31:48.4217111Z 
2025-05-07T20:31:48.4217290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.4217818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.4218295Z                            module_map=module_map)
2025-05-07T20:31:48.4218671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.4219035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.4219301Z E       ^
2025-05-07T20:31:48.4219783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.4220237Z 
2025-05-07T20:31:48.4220663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.4221285Z 
2025-05-07T20:31:48.4221423Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4221857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4222265Z     T=16384,
2025-05-07T20:31:48.4222468Z     D=5120,
2025-05-07T20:31:48.4222664Z     scale_ub=None,
2025-05-07T20:31:48.4222887Z     contiguous=False,
2025-05-07T20:31:48.4223124Z     compiled=False,
2025-05-07T20:31:48.4223334Z )
2025-05-07T20:31:48.4223662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4224183Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.4224503Z 
2025-05-07T20:31:48.4224585Z     @given(
2025-05-07T20:31:48.4224916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4225243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4225556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4225895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4226235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4226532Z     )
2025-05-07T20:31:48.4226884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4227340Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4227600Z         self,
2025-05-07T20:31:48.4227798Z         T: int,
2025-05-07T20:31:48.4228006Z         D: int,
2025-05-07T20:31:48.4228238Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4228516Z         contiguous: bool,
2025-05-07T20:31:48.4228772Z         compiled: bool,
2025-05-07T20:31:48.4229007Z     ) -> None:
2025-05-07T20:31:48.4229225Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4229478Z     
2025-05-07T20:31:48.4229768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4230114Z     
2025-05-07T20:31:48.4230313Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.4230613Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.4232656Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.4234549Z 
2025-05-07T20:31:48.4234671Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.4234975Z 
2025-05-07T20:31:48.4235094Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4235509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4235920Z     T=4096,
2025-05-07T20:31:48.4236115Z     D=7168,
2025-05-07T20:31:48.4236318Z     scale_ub=1200.0,
2025-05-07T20:31:48.4236544Z     contiguous=True,
2025-05-07T20:31:48.4236775Z     compiled=True,
2025-05-07T20:31:48.4236993Z )
2025-05-07T20:31:48.4237314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4237820Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.4238104Z 
2025-05-07T20:31:48.4238185Z     @given(
2025-05-07T20:31:48.4238427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4238743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4239059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4239398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4239743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4240039Z     )
2025-05-07T20:31:48.4240822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4241300Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4241548Z         self,
2025-05-07T20:31:48.4241748Z         T: int,
2025-05-07T20:31:48.4241943Z         D: int,
2025-05-07T20:31:48.4242166Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4242443Z         contiguous: bool,
2025-05-07T20:31:48.4242685Z         compiled: bool,
2025-05-07T20:31:48.4242905Z     ) -> None:
2025-05-07T20:31:48.4243124Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4243375Z     
2025-05-07T20:31:48.4243642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4243990Z     
2025-05-07T20:31:48.4244184Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.4244472Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.4246619Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.4248580Z 
2025-05-07T20:31:48.4248704Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.4248926Z 
2025-05-07T20:31:48.4249029Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4249445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4249843Z     T=16384,
2025-05-07T20:31:48.4250044Z     D=7168,
2025-05-07T20:31:48.4250236Z     scale_ub=None,
2025-05-07T20:31:48.4250458Z     contiguous=False,
2025-05-07T20:31:48.4250688Z     compiled=False,
2025-05-07T20:31:48.4250899Z )
2025-05-07T20:31:48.5301638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5302209Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5302497Z 
2025-05-07T20:31:48.5302587Z     @given(
2025-05-07T20:31:48.5302821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5303145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5303461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5303843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5304187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5304482Z     )
2025-05-07T20:31:48.5304834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5305293Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5305871Z         self,
2025-05-07T20:31:48.5306083Z         T: int,
2025-05-07T20:31:48.5306283Z         D: int,
2025-05-07T20:31:48.5306512Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5306796Z         contiguous: bool,
2025-05-07T20:31:48.5307043Z         compiled: bool,
2025-05-07T20:31:48.5307282Z     ) -> None:
2025-05-07T20:31:48.5307509Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5307755Z     
2025-05-07T20:31:48.5308039Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5310125Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5312053Z 
2025-05-07T20:31:48.5312177Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5312395Z 
2025-05-07T20:31:48.5312506Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5312929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5313332Z     T=2048,
2025-05-07T20:31:48.5313532Z     D=7168,
2025-05-07T20:31:48.5313735Z     scale_ub=1200.0,
2025-05-07T20:31:48.5313961Z     contiguous=True,
2025-05-07T20:31:48.5314195Z     compiled=True,
2025-05-07T20:31:48.5314407Z )
2025-05-07T20:31:48.5314725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5315231Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.5315505Z 
2025-05-07T20:31:48.5315591Z     @given(
2025-05-07T20:31:48.5315961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5316292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5316609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5316942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5317300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5317604Z     )
2025-05-07T20:31:48.5317957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5318407Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5327102Z         self,
2025-05-07T20:31:48.5327317Z         T: int,
2025-05-07T20:31:48.5327527Z         D: int,
2025-05-07T20:31:48.5327750Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5328041Z         contiguous: bool,
2025-05-07T20:31:48.5328296Z         compiled: bool,
2025-05-07T20:31:48.5328532Z     ) -> None:
2025-05-07T20:31:48.5328755Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5329012Z     
2025-05-07T20:31:48.5329319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5329671Z     
2025-05-07T20:31:48.5329863Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5330156Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5332183Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5334060Z 
2025-05-07T20:31:48.5334185Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.5334412Z 
2025-05-07T20:31:48.5334522Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5335070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5335484Z     T=2048,
2025-05-07T20:31:48.5335676Z     D=7168,
2025-05-07T20:31:48.5335879Z     scale_ub=None,
2025-05-07T20:31:48.5336110Z     contiguous=True,
2025-05-07T20:31:48.5336339Z     compiled=False,
2025-05-07T20:31:48.5336560Z )
2025-05-07T20:31:48.5336886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5337389Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.5337671Z 
2025-05-07T20:31:48.5337753Z     @given(
2025-05-07T20:31:48.5338000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5338318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5338639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5338981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5339337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5339634Z     )
2025-05-07T20:31:48.5339998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5340755Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5340999Z         self,
2025-05-07T20:31:48.5341280Z         T: int,
2025-05-07T20:31:48.5341495Z         D: int,
2025-05-07T20:31:48.5341718Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5341996Z         contiguous: bool,
2025-05-07T20:31:48.5342248Z         compiled: bool,
2025-05-07T20:31:48.5342472Z     ) -> None:
2025-05-07T20:31:48.5342694Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5342947Z     
2025-05-07T20:31:48.5343221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5343573Z     
2025-05-07T20:31:48.5343775Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.5345841Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5347714Z 
2025-05-07T20:31:48.5347845Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.5348068Z 
2025-05-07T20:31:48.5348173Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5348595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5349004Z     T=1,
2025-05-07T20:31:48.5349192Z     D=7168,
2025-05-07T20:31:48.5349394Z     scale_ub=1200.0,
2025-05-07T20:31:48.5349632Z     contiguous=True,
2025-05-07T20:31:48.5349857Z     compiled=False,
2025-05-07T20:31:48.5350082Z )
2025-05-07T20:31:48.6911980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6912699Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6913079Z 
2025-05-07T20:31:48.6913186Z     @given(
2025-05-07T20:31:48.6913501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6913920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6914253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6914601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6914936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6915233Z     )
2025-05-07T20:31:48.6915598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6916043Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6916295Z         self,
2025-05-07T20:31:48.6916505Z         T: int,
2025-05-07T20:31:48.6916714Z         D: int,
2025-05-07T20:31:48.6917301Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6917591Z         contiguous: bool,
2025-05-07T20:31:48.6917843Z         compiled: bool,
2025-05-07T20:31:48.6918077Z     ) -> None:
2025-05-07T20:31:48.6918306Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6918561Z     
2025-05-07T20:31:48.6918837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6919191Z     
2025-05-07T20:31:48.6919397Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6919693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6920018Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6920270Z         x0 = x[:, :D]
2025-05-07T20:31:48.6920490Z         x1 = x[:, D:]
2025-05-07T20:31:48.6920708Z     
2025-05-07T20:31:48.6920906Z         if contiguous:
2025-05-07T20:31:48.6921141Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6921409Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6921661Z     
2025-05-07T20:31:48.6921870Z         if scale_ub is not None:
2025-05-07T20:31:48.6922152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6922496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6922814Z             )
2025-05-07T20:31:48.6923011Z         else:
2025-05-07T20:31:48.6923230Z             scale_ub_tensor = None
2025-05-07T20:31:48.6923494Z     
2025-05-07T20:31:48.6923729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6924056Z             op = silu_mul_quant
2025-05-07T20:31:48.6924321Z             if compiled:
2025-05-07T20:31:48.6924575Z                 op = torch.compile(op)
2025-05-07T20:31:48.6924888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6925177Z     
2025-05-07T20:31:48.6925373Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6925552Z 
2025-05-07T20:31:48.6925658Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6925970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6926454Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6926755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6927460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6928162Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6928711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6929405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6930082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6930626Z     kernel = self.compile(
2025-05-07T20:31:48.6931182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6931853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6932257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6932497Z 
2025-05-07T20:31:48.6932710Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a82340>
2025-05-07T20:31:48.6933801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6935176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6798550>}
2025-05-07T20:31:48.6936517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6937534Z context = <triton._C.libtriton.ir.context object at 0x7f2fd67a5530>
2025-05-07T20:31:48.6937918Z 
2025-05-07T20:31:48.6938092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6938628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6939103Z                            module_map=module_map)
2025-05-07T20:31:48.6939473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6939836Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6940390Z E       ^
2025-05-07T20:31:48.6940854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6941361Z 
2025-05-07T20:31:48.6941781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6942301Z 
2025-05-07T20:31:48.6942406Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6942841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6943243Z     T=128,
2025-05-07T20:31:48.6943441Z     D=5120,
2025-05-07T20:31:48.6943644Z     scale_ub=None,
2025-05-07T20:31:48.6943863Z     contiguous=True,
2025-05-07T20:31:48.6944100Z     compiled=False,
2025-05-07T20:31:48.6944345Z )
2025-05-07T20:31:48.6944691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6945191Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6945458Z 
2025-05-07T20:31:48.6945544Z     @given(
2025-05-07T20:31:48.6945774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6946099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6946414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6946751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6947080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6947500Z     )
2025-05-07T20:31:48.6947864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6948309Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6948559Z         self,
2025-05-07T20:31:48.6948762Z         T: int,
2025-05-07T20:31:48.6948962Z         D: int,
2025-05-07T20:31:48.6949194Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6949484Z         contiguous: bool,
2025-05-07T20:31:48.6949725Z         compiled: bool,
2025-05-07T20:31:48.6949959Z     ) -> None:
2025-05-07T20:31:48.6950182Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6950424Z     
2025-05-07T20:31:48.6950700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6951048Z     
2025-05-07T20:31:48.6951242Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6951543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6951859Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6952111Z         x0 = x[:, :D]
2025-05-07T20:31:48.6952341Z         x1 = x[:, D:]
2025-05-07T20:31:48.6952557Z     
2025-05-07T20:31:48.6952749Z         if contiguous:
2025-05-07T20:31:48.6952981Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6953254Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6953506Z     
2025-05-07T20:31:48.6953701Z         if scale_ub is not None:
2025-05-07T20:31:48.6953982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6954333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6954645Z             )
2025-05-07T20:31:48.6954858Z         else:
2025-05-07T20:31:48.6955082Z             scale_ub_tensor = None
2025-05-07T20:31:48.6955342Z     
2025-05-07T20:31:48.6955591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6955916Z             op = silu_mul_quant
2025-05-07T20:31:48.6956170Z             if compiled:
2025-05-07T20:31:48.6956428Z                 op = torch.compile(op)
2025-05-07T20:31:48.6956746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6957182Z     
2025-05-07T20:31:48.6957375Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6957556Z 
2025-05-07T20:31:48.6957657Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6957963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6958297Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6958590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6959285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6959978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6960518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6961209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6961882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6962419Z     kernel = self.compile(
2025-05-07T20:31:48.6962963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6963647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6964074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6964307Z 
2025-05-07T20:31:48.6964518Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd67c7760>
2025-05-07T20:31:48.6965595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6967068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd696b040>}
2025-05-07T20:31:48.6968426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6969440Z context = <triton._C.libtriton.ir.context object at 0x7f2fd696a2b0>
2025-05-07T20:31:48.6969741Z 
2025-05-07T20:31:48.6969913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6970448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6970923Z                            module_map=module_map)
2025-05-07T20:31:48.6971289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6971655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6971929Z E       ^
2025-05-07T20:31:48.6972399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6972872Z 
2025-05-07T20:31:48.6973294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6973810Z 
2025-05-07T20:31:48.6973920Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6974344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6974751Z     T=128,
2025-05-07T20:31:48.6974947Z     D=7168,
2025-05-07T20:31:48.6975148Z     scale_ub=None,
2025-05-07T20:31:48.6975364Z     contiguous=True,
2025-05-07T20:31:48.6975597Z     compiled=False,
2025-05-07T20:31:48.6975815Z )
2025-05-07T20:31:48.7875553Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.7876099Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.7876371Z 
2025-05-07T20:31:48.7876459Z     @given(
2025-05-07T20:31:48.7876712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.7877279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.7877604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.7877946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.7878281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.7878579Z     )
2025-05-07T20:31:48.7878937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.7879382Z     def test_silu_mul_quant(
2025-05-07T20:31:48.7879633Z         self,
2025-05-07T20:31:48.7879840Z         T: int,
2025-05-07T20:31:48.7880039Z         D: int,
2025-05-07T20:31:48.7880268Z         scale_ub: Optional[float],
2025-05-07T20:31:48.7880551Z         contiguous: bool,
2025-05-07T20:31:48.7880795Z         compiled: bool,
2025-05-07T20:31:48.7881034Z     ) -> None:
2025-05-07T20:31:48.7881260Z         torch.manual_seed(2025)
2025-05-07T20:31:48.7881506Z     
2025-05-07T20:31:48.7881988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.7882353Z     
2025-05-07T20:31:48.7882556Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.7882853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.7883174Z         x = x_sign * x_clamp
2025-05-07T20:31:48.7883428Z         x0 = x[:, :D]
2025-05-07T20:31:48.7883646Z         x1 = x[:, D:]
2025-05-07T20:31:48.7883861Z     
2025-05-07T20:31:48.7884056Z         if contiguous:
2025-05-07T20:31:48.7884293Z             x0 = x0.contiguous()
2025-05-07T20:31:48.7884564Z             x1 = x1.contiguous()
2025-05-07T20:31:48.7884818Z     
2025-05-07T20:31:48.7885014Z         if scale_ub is not None:
2025-05-07T20:31:48.7885299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.7885654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.7885966Z             )
2025-05-07T20:31:48.7886168Z         else:
2025-05-07T20:31:48.7886389Z             scale_ub_tensor = None
2025-05-07T20:31:48.7886801Z     
2025-05-07T20:31:48.7887044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.7887369Z             op = silu_mul_quant
2025-05-07T20:31:48.7887631Z             if compiled:
2025-05-07T20:31:48.7887881Z                 op = torch.compile(op)
2025-05-07T20:31:48.7888184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.7888468Z     
2025-05-07T20:31:48.7888661Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.7888834Z 
2025-05-07T20:31:48.7888939Z moe/activation_test.py:117: 
2025-05-07T20:31:48.7889241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.7889574Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.7889865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.7890560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.7891258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.7891810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.7892497Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.7893166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.7893701Z     kernel = self.compile(
2025-05-07T20:31:48.7894299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.7894958Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.7895368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.7895601Z 
2025-05-07T20:31:48.7895811Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6949550>
2025-05-07T20:31:48.7896911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.7898363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd696bc10>}
2025-05-07T20:31:48.7899702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.7900721Z context = <triton._C.libtriton.ir.context object at 0x7f2fd67f10b0>
2025-05-07T20:31:48.7901013Z 
2025-05-07T20:31:48.7901245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.7901785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.7902265Z                            module_map=module_map)
2025-05-07T20:31:48.7902642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.7903004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.7903274Z E       ^
2025-05-07T20:31:48.7903738Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.7904200Z 
2025-05-07T20:31:48.7904617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.7905140Z 
2025-05-07T20:31:48.7905247Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.7905670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.7906082Z     T=2048,
2025-05-07T20:31:48.7906272Z     D=7168,
2025-05-07T20:31:48.7906473Z     scale_ub=1200.0,
2025-05-07T20:31:48.7906710Z     contiguous=True,
2025-05-07T20:31:48.7906937Z     compiled=False,
2025-05-07T20:31:48.7907162Z )
2025-05-07T20:31:48.7907576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.7908076Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.7908361Z 
2025-05-07T20:31:48.7908445Z     @given(
2025-05-07T20:31:48.7908685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.7909000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.7909321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.7909662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.7910000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.7910293Z     )
2025-05-07T20:31:48.7910651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.7911102Z     def test_silu_mul_quant(
2025-05-07T20:31:48.7911348Z         self,
2025-05-07T20:31:48.7911552Z         T: int,
2025-05-07T20:31:48.7911755Z         D: int,
2025-05-07T20:31:48.7911990Z         scale_ub: Optional[float],
2025-05-07T20:31:48.7912272Z         contiguous: bool,
2025-05-07T20:31:48.7912520Z         compiled: bool,
2025-05-07T20:31:48.7912744Z     ) -> None:
2025-05-07T20:31:48.7912977Z         torch.manual_seed(2025)
2025-05-07T20:31:48.7913228Z     
2025-05-07T20:31:48.7913501Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.7915596Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.7917737Z 
2025-05-07T20:31:48.7917865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.7918090Z 
2025-05-07T20:31:48.7918195Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.7918617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.7919018Z     T=1,
2025-05-07T20:31:48.7919212Z     D=5120,
2025-05-07T20:31:48.7919412Z     scale_ub=1200.0,
2025-05-07T20:31:48.7919635Z     contiguous=True,
2025-05-07T20:31:48.7919866Z     compiled=False,
2025-05-07T20:31:48.7920080Z )
2025-05-07T20:31:48.8411177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.8411730Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.8411996Z 
2025-05-07T20:31:48.8412077Z     @given(
2025-05-07T20:31:48.8412317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.8412638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.8412971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.8413313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.8413655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.8413947Z     )
2025-05-07T20:31:48.8414299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.8414747Z     def test_silu_mul_quant(
2025-05-07T20:31:48.8414997Z         self,
2025-05-07T20:31:48.8415193Z         T: int,
2025-05-07T20:31:48.8415398Z         D: int,
2025-05-07T20:31:48.8415628Z         scale_ub: Optional[float],
2025-05-07T20:31:48.8415902Z         contiguous: bool,
2025-05-07T20:31:48.8416149Z         compiled: bool,
2025-05-07T20:31:48.8416384Z     ) -> None:
2025-05-07T20:31:48.8416604Z         torch.manual_seed(2025)
2025-05-07T20:31:48.8416859Z     
2025-05-07T20:31:48.8417139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.8417490Z     
2025-05-07T20:31:48.8417917Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.8418231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.8418553Z         x = x_sign * x_clamp
2025-05-07T20:31:48.8418801Z         x0 = x[:, :D]
2025-05-07T20:31:48.8419024Z         x1 = x[:, D:]
2025-05-07T20:31:48.8419235Z     
2025-05-07T20:31:48.8419422Z         if contiguous:
2025-05-07T20:31:48.8419660Z             x0 = x0.contiguous()
2025-05-07T20:31:48.8419928Z             x1 = x1.contiguous()
2025-05-07T20:31:48.8420173Z     
2025-05-07T20:31:48.8420370Z         if scale_ub is not None:
2025-05-07T20:31:48.8420654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.8420991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.8421413Z             )
2025-05-07T20:31:48.8429619Z         else:
2025-05-07T20:31:48.8429892Z             scale_ub_tensor = None
2025-05-07T20:31:48.8430156Z     
2025-05-07T20:31:48.8430405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.8430745Z             op = silu_mul_quant
2025-05-07T20:31:48.8431013Z             if compiled:
2025-05-07T20:31:48.8431272Z                 op = torch.compile(op)
2025-05-07T20:31:48.8431573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.8431861Z     
2025-05-07T20:31:48.8432065Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.8432234Z 
2025-05-07T20:31:48.8432350Z moe/activation_test.py:117: 
2025-05-07T20:31:48.8432654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.8433001Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.8433293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.8433986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.8434735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.8435283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.8436189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.8436865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.8437413Z     kernel = self.compile(
2025-05-07T20:31:48.8437970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.8438637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.8439052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.8439295Z 
2025-05-07T20:31:48.8439504Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6803640>
2025-05-07T20:31:48.8440888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.8442295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd66ef9d0>}
2025-05-07T20:31:48.8443654Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.8444716Z context = <triton._C.libtriton.ir.context object at 0x7f2fd68eb7b0>
2025-05-07T20:31:48.8445005Z 
2025-05-07T20:31:48.8445187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.8445727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.8446191Z                            module_map=module_map)
2025-05-07T20:31:48.8446569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.8447060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.8447324Z E       ^
2025-05-07T20:31:48.8447796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.8448244Z 
2025-05-07T20:31:48.8448666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.8449177Z 
2025-05-07T20:31:48.8449295Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.8449714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.8450124Z     T=2048,
2025-05-07T20:31:48.8450317Z     D=5120,
2025-05-07T20:31:48.8450507Z     scale_ub=None,
2025-05-07T20:31:48.8450734Z     contiguous=True,
2025-05-07T20:31:48.8450969Z     compiled=False,
2025-05-07T20:31:48.8451178Z )
2025-05-07T20:31:48.8451503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.8452018Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.8452289Z 
2025-05-07T20:31:48.8452374Z     @given(
2025-05-07T20:31:48.8452604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.8452923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.8453238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.8453570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.8453908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.8454237Z     )
2025-05-07T20:31:48.8454608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.8455057Z     def test_silu_mul_quant(
2025-05-07T20:31:48.8455307Z         self,
2025-05-07T20:31:48.8455500Z         T: int,
2025-05-07T20:31:48.8455696Z         D: int,
2025-05-07T20:31:48.8455913Z         scale_ub: Optional[float],
2025-05-07T20:31:48.8456187Z         contiguous: bool,
2025-05-07T20:31:48.8456566Z         compiled: bool,
2025-05-07T20:31:48.8456798Z     ) -> None:
2025-05-07T20:31:48.8457027Z         torch.manual_seed(2025)
2025-05-07T20:31:48.8457271Z     
2025-05-07T20:31:48.8457556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.8457912Z     
2025-05-07T20:31:48.8458107Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.8460045Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.8461996Z 
2025-05-07T20:31:48.8462129Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.8462351Z 
2025-05-07T20:31:48.8462468Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.8462887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.8463290Z     T=16384,
2025-05-07T20:31:48.8463491Z     D=5120,
2025-05-07T20:31:48.8463689Z     scale_ub=None,
2025-05-07T20:31:48.8463902Z     contiguous=True,
2025-05-07T20:31:48.8464139Z     compiled=False,
2025-05-07T20:31:48.8464376Z )
2025-05-07T20:31:48.8464718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.8465228Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.8465504Z 
2025-05-07T20:31:48.8465592Z     @given(
2025-05-07T20:31:48.8465820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.8466141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.8466458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.8466881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.8467231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.8467530Z     )
2025-05-07T20:31:48.8467886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.8468333Z     def test_silu_mul_quant(
2025-05-07T20:31:48.8468581Z         self,
2025-05-07T20:31:48.8468783Z         T: int,
2025-05-07T20:31:48.8468984Z         D: int,
2025-05-07T20:31:48.8469209Z         scale_ub: Optional[float],
2025-05-07T20:31:48.8469489Z         contiguous: bool,
2025-05-07T20:31:48.8469727Z         compiled: bool,
2025-05-07T20:31:48.8469958Z     ) -> None:
2025-05-07T20:31:48.8470179Z         torch.manual_seed(2025)
2025-05-07T20:31:48.8470421Z     
2025-05-07T20:31:48.8470701Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.8472735Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.8474585Z 
2025-05-07T20:31:48.8474711Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.8474928Z 
2025-05-07T20:31:48.8475042Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.8475454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.8475865Z     T=4096,
2025-05-07T20:31:48.8476058Z     D=5120,
2025-05-07T20:31:48.8476249Z     scale_ub=None,
2025-05-07T20:31:48.8476473Z     contiguous=True,
2025-05-07T20:31:48.8476707Z     compiled=False,
2025-05-07T20:31:48.8476998Z )
2025-05-07T20:31:48.9503334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.9503921Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.9504324Z 
2025-05-07T20:31:48.9504501Z     @given(
2025-05-07T20:31:48.9504973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.9505603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.9506233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.9506898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.9507546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.9508128Z     )
2025-05-07T20:31:48.9508829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.9509706Z     def test_silu_mul_quant(
2025-05-07T20:31:48.9510197Z         self,
2025-05-07T20:31:48.9510597Z         T: int,
2025-05-07T20:31:48.9511024Z         D: int,
2025-05-07T20:31:48.9511467Z         scale_ub: Optional[float],
2025-05-07T20:31:48.9512020Z         contiguous: bool,
2025-05-07T20:31:48.9512496Z         compiled: bool,
2025-05-07T20:31:48.9512949Z     ) -> None:
2025-05-07T20:31:48.9513388Z         torch.manual_seed(2025)
2025-05-07T20:31:48.9513746Z     
2025-05-07T20:31:48.9514019Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.9516083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.9518182Z 
2025-05-07T20:31:48.9518312Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.9518535Z 
2025-05-07T20:31:48.9518648Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.9519061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.9519471Z     T=2048,
2025-05-07T20:31:48.9519671Z     D=5120,
2025-05-07T20:31:48.9519871Z     scale_ub=None,
2025-05-07T20:31:48.9520087Z     contiguous=False,
2025-05-07T20:31:48.9520323Z     compiled=False,
2025-05-07T20:31:48.9520538Z )
2025-05-07T20:31:48.9520853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.9521352Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.9521626Z 
2025-05-07T20:31:48.9521713Z     @given(
2025-05-07T20:31:48.9521942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.9522260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.9522588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.9522916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.9523257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.9523555Z     )
2025-05-07T20:31:48.9523939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.9524406Z     def test_silu_mul_quant(
2025-05-07T20:31:48.9524657Z         self,
2025-05-07T20:31:48.9524859Z         T: int,
2025-05-07T20:31:48.9525058Z         D: int,
2025-05-07T20:31:48.9525288Z         scale_ub: Optional[float],
2025-05-07T20:31:48.9525571Z         contiguous: bool,
2025-05-07T20:31:48.9525817Z         compiled: bool,
2025-05-07T20:31:48.9526048Z     ) -> None:
2025-05-07T20:31:48.9526273Z         torch.manual_seed(2025)
2025-05-07T20:31:48.9526518Z     
2025-05-07T20:31:48.9526794Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.9528817Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.9530822Z 
2025-05-07T20:31:48.9530943Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.9531165Z 
2025-05-07T20:31:48.9531277Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.9531690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.9532097Z     T=4096,
2025-05-07T20:31:48.9532291Z     D=7168,
2025-05-07T20:31:48.9532483Z     scale_ub=None,
2025-05-07T20:31:48.9532714Z     contiguous=True,
2025-05-07T20:31:48.9532943Z     compiled=True,
2025-05-07T20:31:48.9533147Z )
2025-05-07T20:31:48.9533475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.9533971Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.9534244Z 
2025-05-07T20:31:48.9534331Z     @given(
2025-05-07T20:31:48.9534563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.9534888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.9535203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.9535533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.9535870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.9536167Z     )
2025-05-07T20:31:48.9536518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.9536971Z     def test_silu_mul_quant(
2025-05-07T20:31:48.9537223Z         self,
2025-05-07T20:31:48.9537535Z         T: int,
2025-05-07T20:31:48.9537746Z         D: int,
2025-05-07T20:31:48.9537973Z         scale_ub: Optional[float],
2025-05-07T20:31:48.9538246Z         contiguous: bool,
2025-05-07T20:31:48.9538495Z         compiled: bool,
2025-05-07T20:31:48.9538723Z     ) -> None:
2025-05-07T20:31:48.9538945Z         torch.manual_seed(2025)
2025-05-07T20:31:48.9539187Z     
2025-05-07T20:31:48.9539464Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.9541843Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.9543724Z 
2025-05-07T20:31:48.9543852Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.9544067Z 
2025-05-07T20:31:48.9544172Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.9544592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.9544999Z     T=2048,
2025-05-07T20:31:48.9545194Z     D=5120,
2025-05-07T20:31:48.9545385Z     scale_ub=1200.0,
2025-05-07T20:31:48.9545618Z     contiguous=False,
2025-05-07T20:31:48.9545855Z     compiled=False,
2025-05-07T20:31:48.9546060Z )
2025-05-07T20:31:48.9546387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.9546889Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.9547167Z 
2025-05-07T20:31:48.9547249Z     @given(
2025-05-07T20:31:48.9547485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.9547941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.9548255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.9548597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.9548936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.9549233Z     )
2025-05-07T20:31:48.9549585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.9550032Z     def test_silu_mul_quant(
2025-05-07T20:31:48.9550282Z         self,
2025-05-07T20:31:48.9550478Z         T: int,
2025-05-07T20:31:48.9550680Z         D: int,
2025-05-07T20:31:48.9550906Z         scale_ub: Optional[float],
2025-05-07T20:31:48.9551181Z         contiguous: bool,
2025-05-07T20:31:48.9551426Z         compiled: bool,
2025-05-07T20:31:48.9551658Z     ) -> None:
2025-05-07T20:31:48.9551879Z         torch.manual_seed(2025)
2025-05-07T20:31:48.9552129Z     
2025-05-07T20:31:48.9552410Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.9554487Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.9556349Z 
2025-05-07T20:31:48.9556479Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.9556694Z 
2025-05-07T20:31:48.9556797Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.9557215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.9557630Z     T=4096,
2025-05-07T20:31:48.9557820Z     D=7168,
2025-05-07T20:31:48.9558138Z     scale_ub=1200.0,
2025-05-07T20:31:48.9558374Z     contiguous=True,
2025-05-07T20:31:48.9558597Z     compiled=False,
2025-05-07T20:31:48.9558814Z )
2025-05-07T20:31:48.9559137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.9559631Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.9559912Z 
2025-05-07T20:31:48.9559990Z     @given(
2025-05-07T20:31:48.9560224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.9560541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.9560848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.9561182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.9561520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.9561804Z     )
2025-05-07T20:31:48.9562158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.9562615Z     def test_silu_mul_quant(
2025-05-07T20:31:48.9562857Z         self,
2025-05-07T20:31:48.9563058Z         T: int,
2025-05-07T20:31:48.9563262Z         D: int,
2025-05-07T20:31:48.9563480Z         scale_ub: Optional[float],
2025-05-07T20:31:48.9563759Z         contiguous: bool,
2025-05-07T20:31:48.9564007Z         compiled: bool,
2025-05-07T20:31:48.9564234Z     ) -> None:
2025-05-07T20:31:48.9564450Z         torch.manual_seed(2025)
2025-05-07T20:31:48.9564698Z     
2025-05-07T20:31:48.9564973Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.9566985Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.9568943Z 
2025-05-07T20:31:48.9569063Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.9569285Z 
2025-05-07T20:31:48.9569390Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.9569807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.9570214Z     T=16384,
2025-05-07T20:31:48.9570406Z     D=7168,
2025-05-07T20:31:48.9570601Z     scale_ub=None,
2025-05-07T20:31:48.9570827Z     contiguous=False,
2025-05-07T20:31:48.9571051Z     compiled=True,
2025-05-07T20:31:48.9571261Z )
2025-05-07T20:31:49.0869363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0870080Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.0870500Z 
2025-05-07T20:31:49.0870616Z     @given(
2025-05-07T20:31:49.0870892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0871217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0871535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0871868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0872207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0872502Z     )
2025-05-07T20:31:49.0872853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0873305Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0873562Z         self,
2025-05-07T20:31:49.0873776Z         T: int,
2025-05-07T20:31:49.0873978Z         D: int,
2025-05-07T20:31:49.0874209Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0874495Z         contiguous: bool,
2025-05-07T20:31:49.0874739Z         compiled: bool,
2025-05-07T20:31:49.0874971Z     ) -> None:
2025-05-07T20:31:49.0875193Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0875794Z     
2025-05-07T20:31:49.0876082Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0878152Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.0880029Z 
2025-05-07T20:31:49.0880157Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.0880371Z 
2025-05-07T20:31:49.0880484Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0880897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0881315Z     T=4096,
2025-05-07T20:31:49.0881511Z     D=7168,
2025-05-07T20:31:49.0881701Z     scale_ub=None,
2025-05-07T20:31:49.0881925Z     contiguous=True,
2025-05-07T20:31:49.0882156Z     compiled=False,
2025-05-07T20:31:49.0882365Z )
2025-05-07T20:31:49.0882690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0883194Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.0883465Z 
2025-05-07T20:31:49.0883546Z     @given(
2025-05-07T20:31:49.0883792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0884105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0884457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0884824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0885157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0885452Z     )
2025-05-07T20:31:49.0885817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0886420Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0886672Z         self,
2025-05-07T20:31:49.0886881Z         T: int,
2025-05-07T20:31:49.0887081Z         D: int,
2025-05-07T20:31:49.0887314Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0887595Z         contiguous: bool,
2025-05-07T20:31:49.0887847Z         compiled: bool,
2025-05-07T20:31:49.0888073Z     ) -> None:
2025-05-07T20:31:49.0888299Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0888552Z     
2025-05-07T20:31:49.0888823Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0890847Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.0892692Z 
2025-05-07T20:31:49.0892815Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.0893032Z 
2025-05-07T20:31:49.0893142Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0893553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0893961Z     T=16384,
2025-05-07T20:31:49.0894164Z     D=7168,
2025-05-07T20:31:49.0894361Z     scale_ub=None,
2025-05-07T20:31:49.0894576Z     contiguous=True,
2025-05-07T20:31:49.0894809Z     compiled=False,
2025-05-07T20:31:49.0895019Z )
2025-05-07T20:31:49.0895338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0895840Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.0896204Z 
2025-05-07T20:31:49.0896295Z     @given(
2025-05-07T20:31:49.0896527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0896847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0897163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0897492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0897832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0898125Z     )
2025-05-07T20:31:49.0898479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0898923Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0899174Z         self,
2025-05-07T20:31:49.0899378Z         T: int,
2025-05-07T20:31:49.0899577Z         D: int,
2025-05-07T20:31:49.0899803Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0900082Z         contiguous: bool,
2025-05-07T20:31:49.0900323Z         compiled: bool,
2025-05-07T20:31:49.0900554Z     ) -> None:
2025-05-07T20:31:49.0900788Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0901035Z     
2025-05-07T20:31:49.0901439Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0903468Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.0905302Z 
2025-05-07T20:31:49.0905423Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.0905638Z 
2025-05-07T20:31:49.0905751Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0906295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0906712Z     T=16384,
2025-05-07T20:31:49.0906913Z     D=7168,
2025-05-07T20:31:49.0907106Z     scale_ub=1200.0,
2025-05-07T20:31:49.0907338Z     contiguous=True,
2025-05-07T20:31:49.0907568Z     compiled=False,
2025-05-07T20:31:49.0907774Z )
2025-05-07T20:31:49.0908099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0908604Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.0908881Z 
2025-05-07T20:31:49.0908974Z     @given(
2025-05-07T20:31:49.0909204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0909527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0909841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0910172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0910511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0910817Z     )
2025-05-07T20:31:49.0911167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0911617Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0911869Z         self,
2025-05-07T20:31:49.0912065Z         T: int,
2025-05-07T20:31:49.0912271Z         D: int,
2025-05-07T20:31:49.0912497Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0912776Z         contiguous: bool,
2025-05-07T20:31:49.0913021Z         compiled: bool,
2025-05-07T20:31:49.0913253Z     ) -> None:
2025-05-07T20:31:49.0921481Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0921787Z     
2025-05-07T20:31:49.0922082Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0924238Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.0926161Z 
2025-05-07T20:31:49.0926295Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.0926512Z 
2025-05-07T20:31:49.0926619Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0927047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0927458Z     T=128,
2025-05-07T20:31:49.0927648Z     D=5120,
2025-05-07T20:31:49.0927850Z     scale_ub=1200.0,
2025-05-07T20:31:49.0928087Z     contiguous=False,
2025-05-07T20:31:49.0928315Z     compiled=False,
2025-05-07T20:31:49.0928532Z )
2025-05-07T20:31:49.4798676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4799304Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.4799587Z 
2025-05-07T20:31:49.4799691Z     @given(
2025-05-07T20:31:49.4799935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4800272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4800603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4800948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4801304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4801608Z     )
2025-05-07T20:31:49.4801976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4802426Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4802686Z         self,
2025-05-07T20:31:49.4802900Z         T: int,
2025-05-07T20:31:49.4803107Z         D: int,
2025-05-07T20:31:49.4803342Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4803631Z         contiguous: bool,
2025-05-07T20:31:49.4804265Z         compiled: bool,
2025-05-07T20:31:49.4804539Z     ) -> None:
2025-05-07T20:31:49.4804797Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4805050Z     
2025-05-07T20:31:49.4805340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4805699Z     
2025-05-07T20:31:49.4805902Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4806217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4806548Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4806801Z         x0 = x[:, :D]
2025-05-07T20:31:49.4807040Z         x1 = x[:, D:]
2025-05-07T20:31:49.4807266Z     
2025-05-07T20:31:49.4807470Z         if contiguous:
2025-05-07T20:31:49.4807714Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4807995Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4808259Z     
2025-05-07T20:31:49.4808459Z         if scale_ub is not None:
2025-05-07T20:31:49.4808754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4809122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4809446Z             )
2025-05-07T20:31:49.4809660Z         else:
2025-05-07T20:31:49.4809889Z             scale_ub_tensor = None
2025-05-07T20:31:49.4810150Z     
2025-05-07T20:31:49.4810406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4810740Z             op = silu_mul_quant
2025-05-07T20:31:49.4811001Z             if compiled:
2025-05-07T20:31:49.4811269Z                 op = torch.compile(op)
2025-05-07T20:31:49.4811591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4811874Z     
2025-05-07T20:31:49.4812084Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4812265Z 
2025-05-07T20:31:49.4812375Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4812690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4813031Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4813330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4814229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4814943Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4815505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4816206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4817428Z     kernel = self.compile(
2025-05-07T20:31:49.4817984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4818652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4819060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4819324Z 
2025-05-07T20:31:49.4819539Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd66834c0>
2025-05-07T20:31:49.4820632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4822156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd65c6670>}
2025-05-07T20:31:49.4823519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4824549Z context = <triton._C.libtriton.ir.context object at 0x7f2fd65610f0>
2025-05-07T20:31:49.4824860Z 
2025-05-07T20:31:49.4825066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4825694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4826171Z                            module_map=module_map)
2025-05-07T20:31:49.4826548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4826913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4827190Z E       ^
2025-05-07T20:31:49.4827655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4828111Z 
2025-05-07T20:31:49.4828532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4829057Z 
2025-05-07T20:31:49.4829166Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4829594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4830007Z     T=2048,
2025-05-07T20:31:49.4830216Z     D=7168,
2025-05-07T20:31:49.4830427Z     scale_ub=None,
2025-05-07T20:31:49.4830652Z     contiguous=False,
2025-05-07T20:31:49.4830898Z     compiled=False,
2025-05-07T20:31:49.4831122Z )
2025-05-07T20:31:49.4831446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4831963Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4832252Z 
2025-05-07T20:31:49.4832334Z     @given(
2025-05-07T20:31:49.4832581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4832901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4833227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4833573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4833914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4834214Z     )
2025-05-07T20:31:49.4834643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4835116Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4835373Z         self,
2025-05-07T20:31:49.4835569Z         T: int,
2025-05-07T20:31:49.4835776Z         D: int,
2025-05-07T20:31:49.4836005Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4836288Z         contiguous: bool,
2025-05-07T20:31:49.4836532Z         compiled: bool,
2025-05-07T20:31:49.4836764Z     ) -> None:
2025-05-07T20:31:49.4836990Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4837238Z     
2025-05-07T20:31:49.4837519Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4839568Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.4841754Z 
2025-05-07T20:31:49.4841890Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.4842111Z 
2025-05-07T20:31:49.4842217Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4842641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4843050Z     T=128,
2025-05-07T20:31:49.4843245Z     D=7168,
2025-05-07T20:31:49.4843440Z     scale_ub=1200.0,
2025-05-07T20:31:49.4843682Z     contiguous=True,
2025-05-07T20:31:49.4843930Z     compiled=True,
2025-05-07T20:31:49.4844167Z )
2025-05-07T20:31:49.5299416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5299986Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5300348Z 
2025-05-07T20:31:49.5300714Z     @given(
2025-05-07T20:31:49.5300959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5301377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5301690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5302029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5302365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5302657Z     )
2025-05-07T20:31:49.5303016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5303469Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5303717Z         self,
2025-05-07T20:31:49.5303920Z         T: int,
2025-05-07T20:31:49.5304120Z         D: int,
2025-05-07T20:31:49.5304350Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5304630Z         contiguous: bool,
2025-05-07T20:31:49.5304873Z         compiled: bool,
2025-05-07T20:31:49.5305105Z     ) -> None:
2025-05-07T20:31:49.5305331Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5305587Z     
2025-05-07T20:31:49.5305866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5306220Z     
2025-05-07T20:31:49.5306414Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5306716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5307038Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5307290Z         x0 = x[:, :D]
2025-05-07T20:31:49.5307506Z         x1 = x[:, D:]
2025-05-07T20:31:49.5307722Z     
2025-05-07T20:31:49.5307917Z         if contiguous:
2025-05-07T20:31:49.5308153Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5308422Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5308674Z     
2025-05-07T20:31:49.5308870Z         if scale_ub is not None:
2025-05-07T20:31:49.5309155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5309499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5309815Z             )
2025-05-07T20:31:49.5310170Z         else:
2025-05-07T20:31:49.5310397Z             scale_ub_tensor = None
2025-05-07T20:31:49.5310655Z     
2025-05-07T20:31:49.5310894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5311218Z             op = silu_mul_quant
2025-05-07T20:31:49.5311478Z             if compiled:
2025-05-07T20:31:49.5311740Z                 op = torch.compile(op)
2025-05-07T20:31:49.5312049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5312338Z     
2025-05-07T20:31:49.5312533Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5312711Z 
2025-05-07T20:31:49.5312819Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5313130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5313466Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5313761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5314337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5314906Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5315571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5316261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5316810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5317491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5318157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5318694Z     kernel = self.compile(
2025-05-07T20:31:49.5319243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5319898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5320309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5320638Z 
2025-05-07T20:31:49.5320855Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd65370a0>
2025-05-07T20:31:49.5321930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5323320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd659b5e0>}
2025-05-07T20:31:49.5324721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5325742Z context = <triton._C.libtriton.ir.context object at 0x7f2fd64e2970>
2025-05-07T20:31:49.5326042Z 
2025-05-07T20:31:49.5326221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5326748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5327220Z                            module_map=module_map)
2025-05-07T20:31:49.5327597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5327950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5328220Z E       ^
2025-05-07T20:31:49.5328685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5329135Z 
2025-05-07T20:31:49.5329558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5330068Z 
2025-05-07T20:31:49.5330175Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5330685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5331104Z     T=128,
2025-05-07T20:31:49.5331295Z     D=7168,
2025-05-07T20:31:49.5331505Z     scale_ub=1200.0,
2025-05-07T20:31:49.5331765Z     contiguous=True,
2025-05-07T20:31:49.5332000Z     compiled=False,
2025-05-07T20:31:49.5332215Z )
2025-05-07T20:31:49.5332542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5333041Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5333314Z 
2025-05-07T20:31:49.5333404Z     @given(
2025-05-07T20:31:49.5333635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5333958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5334280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5334614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5334955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5335252Z     )
2025-05-07T20:31:49.5335615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5336077Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5336329Z         self,
2025-05-07T20:31:49.5336534Z         T: int,
2025-05-07T20:31:49.5336733Z         D: int,
2025-05-07T20:31:49.5336958Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5337236Z         contiguous: bool,
2025-05-07T20:31:49.5337477Z         compiled: bool,
2025-05-07T20:31:49.5337708Z     ) -> None:
2025-05-07T20:31:49.5337932Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5338181Z     
2025-05-07T20:31:49.5338462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5338814Z     
2025-05-07T20:31:49.5339008Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5339309Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5341780Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5343823Z 
2025-05-07T20:31:49.5343948Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.5344169Z 
2025-05-07T20:31:49.5344281Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5344694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5345105Z     T=128,
2025-05-07T20:31:49.5345300Z     D=5120,
2025-05-07T20:31:49.5345495Z     scale_ub=1200.0,
2025-05-07T20:31:49.5345726Z     contiguous=True,
2025-05-07T20:31:49.5345954Z     compiled=True,
2025-05-07T20:31:49.5346160Z )
2025-05-07T20:31:49.5346495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5346995Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5347266Z 
2025-05-07T20:31:49.5347352Z     @given(
2025-05-07T20:31:49.5347583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5347908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5348225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5348557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5348894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5349190Z     )
2025-05-07T20:31:49.5349545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5349994Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5350246Z         self,
2025-05-07T20:31:49.5350445Z         T: int,
2025-05-07T20:31:49.5350650Z         D: int,
2025-05-07T20:31:49.5351036Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5351316Z         contiguous: bool,
2025-05-07T20:31:49.5351565Z         compiled: bool,
2025-05-07T20:31:49.5351797Z     ) -> None:
2025-05-07T20:31:49.5352026Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5352271Z     
2025-05-07T20:31:49.5352549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5352900Z     
2025-05-07T20:31:49.5353094Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.5355067Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5356901Z 
2025-05-07T20:31:49.5357021Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.5357236Z 
2025-05-07T20:31:49.5357349Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5357763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5358171Z     T=128,
2025-05-07T20:31:49.5358364Z     D=7168,
2025-05-07T20:31:49.5358561Z     scale_ub=None,
2025-05-07T20:31:49.5358774Z     contiguous=True,
2025-05-07T20:31:49.5359003Z     compiled=True,
2025-05-07T20:31:49.5359211Z )
2025-05-07T20:31:49.8471064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8471638Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.8471917Z 
2025-05-07T20:31:49.8472019Z     @given(
2025-05-07T20:31:49.8472275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8473073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8473394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8473742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8474125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8474436Z     )
2025-05-07T20:31:49.8474804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8475263Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8475524Z         self,
2025-05-07T20:31:49.8475725Z         T: int,
2025-05-07T20:31:49.8475938Z         D: int,
2025-05-07T20:31:49.8476170Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8476450Z         contiguous: bool,
2025-05-07T20:31:49.8476705Z         compiled: bool,
2025-05-07T20:31:49.8476952Z     ) -> None:
2025-05-07T20:31:49.8477179Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8477437Z     
2025-05-07T20:31:49.8477727Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8479777Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.8481663Z 
2025-05-07T20:31:49.8481789Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.8482015Z 
2025-05-07T20:31:49.8542604Z FAILED
2025-05-07T20:31:49.8543024Z 
2025-05-07T20:31:49.8543561Z =================================== FAILURES ===================================
2025-05-07T20:31:49.8544070Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:49.8544847Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:49.8545555Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:31:49.8546116Z   |     yield
2025-05-07T20:31:49.8546564Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:31:49.8547083Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:49.8547656Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:31:49.8548211Z   |     method()
2025-05-07T20:31:49.8548874Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:49.8549612Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8550273Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:49.8550927Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:49.8551432Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:49.8551950Z   +-+---------------- 1 ----------------
2025-05-07T20:31:49.8552257Z     | Traceback (most recent call last):
2025-05-07T20:31:49.8552988Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.8553780Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8555893Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.8558173Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.8558635Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8559059Z     |     T=128,
2025-05-07T20:31:49.8559273Z     |     D=7168,
2025-05-07T20:31:49.8559495Z     |     scale_ub=1200.0,
2025-05-07T20:31:49.8559772Z     |     contiguous=True,
2025-05-07T20:31:49.8560033Z     |     compiled=False,
2025-05-07T20:31:49.8560265Z     | )
2025-05-07T20:31:49.8560458Z     | 
2025-05-07T20:31:49.8561002Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:49.8561636Z     +---------------- 2 ----------------
2025-05-07T20:31:49.8561977Z     | Traceback (most recent call last):
2025-05-07T20:31:49.8562714Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.8576253Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8578298Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.8580395Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.8580858Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8581381Z     |     T=128,
2025-05-07T20:31:49.8581599Z     |     D=7168,
2025-05-07T20:31:49.8581836Z     |     scale_ub=None,
2025-05-07T20:31:49.8582095Z     |     contiguous=True,
2025-05-07T20:31:49.8582348Z     |     compiled=True,
2025-05-07T20:31:49.8582587Z     | )
2025-05-07T20:31:49.8582783Z     | 
2025-05-07T20:31:49.8583316Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.8583954Z     +---------------- 3 ----------------
2025-05-07T20:31:49.8584265Z     | Traceback (most recent call last):
2025-05-07T20:31:49.8584988Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.8585786Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8588363Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.8590366Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.8590875Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8591294Z     |     T=128,
2025-05-07T20:31:49.8591504Z     |     D=5120,
2025-05-07T20:31:49.8591732Z     |     scale_ub=1200.0,
2025-05-07T20:31:49.8592081Z     |     contiguous=True,
2025-05-07T20:31:49.8592327Z     |     compiled=True,
2025-05-07T20:31:49.8592571Z     | )
2025-05-07T20:31:49.8592765Z     | 
2025-05-07T20:31:49.8593288Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.8593904Z     +---------------- 4 ----------------
2025-05-07T20:31:49.8594208Z     | Traceback (most recent call last):
2025-05-07T20:31:49.8594938Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:49.8595654Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8596324Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:49.8597038Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8597899Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:49.8598761Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8599409Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:49.8600159Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8600920Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:49.8601734Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8602567Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:49.8603488Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8604272Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:49.8605020Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8605711Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:49.8606325Z     |     fn()
2025-05-07T20:31:49.8606910Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:49.8607592Z     |     self.fn.run(
2025-05-07T20:31:49.8608155Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:49.8608753Z     |     kernel = self.compile(
2025-05-07T20:31:49.8609396Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:49.8610162Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8610919Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.8611725Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8612252Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8612624Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8612922Z     | ^
2025-05-07T20:31:49.8613403Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8614051Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.8614501Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:49.8615123Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8615573Z     |     T=1,  # or any other generated value
2025-05-07T20:31:49.8615906Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:49.8616272Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:49.8616643Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:49.8617022Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:49.8617340Z     | )
2025-05-07T20:31:49.8617522Z     | 
2025-05-07T20:31:49.8618056Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.8618673Z     +------------------------------------
2025-05-07T20:31:49.8619042Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:49.8619434Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8619876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8620283Z     T=1,
2025-05-07T20:31:49.8620469Z     D=5120,
2025-05-07T20:31:49.8620673Z     scale_ub=None,
2025-05-07T20:31:49.8620896Z     contiguous=True,
2025-05-07T20:31:49.8621188Z     compiled=True,
2025-05-07T20:31:49.8621405Z )
2025-05-07T20:31:49.8621729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8622229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.8622500Z 
2025-05-07T20:31:49.8622580Z     @given(
2025-05-07T20:31:49.8622835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8623159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8623464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8623803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8624232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8624530Z     )
2025-05-07T20:31:49.8624888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8625342Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8625586Z         self,
2025-05-07T20:31:49.8625790Z         T: int,
2025-05-07T20:31:49.8626000Z         D: int,
2025-05-07T20:31:49.8626221Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8626503Z         contiguous: bool,
2025-05-07T20:31:49.8626753Z         compiled: bool,
2025-05-07T20:31:49.8627021Z     ) -> None:
2025-05-07T20:31:49.8627335Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8627697Z     
2025-05-07T20:31:49.8628090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8628583Z     
2025-05-07T20:31:49.8628872Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8629302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8629758Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8630144Z         x0 = x[:, :D]
2025-05-07T20:31:49.8630472Z         x1 = x[:, D:]
2025-05-07T20:31:49.8630780Z     
2025-05-07T20:31:49.8631064Z         if contiguous:
2025-05-07T20:31:49.8631418Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8631803Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8632167Z     
2025-05-07T20:31:49.8632458Z         if scale_ub is not None:
2025-05-07T20:31:49.8632857Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8633347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8633774Z             )
2025-05-07T20:31:49.8634038Z         else:
2025-05-07T20:31:49.8634339Z             scale_ub_tensor = None
2025-05-07T20:31:49.8634705Z     
2025-05-07T20:31:49.8635042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8635488Z             op = silu_mul_quant
2025-05-07T20:31:49.8635862Z             if compiled:
2025-05-07T20:31:49.8636242Z                 op = torch.compile(op)
2025-05-07T20:31:49.8636884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8637292Z     
2025-05-07T20:31:49.8637578Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.8637998Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.8638432Z     
2025-05-07T20:31:49.8638786Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8639277Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.8639702Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.8640382Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.8640906Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8641355Z     
2025-05-07T20:31:49.8641644Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8641922Z 
2025-05-07T20:31:49.8642075Z moe/activation_test.py:126: 
2025-05-07T20:31:49.8642497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8643011Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.8643486Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8644607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.8645692Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8646460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8647437Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8648411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.8649423Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8650722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.8651781Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8652813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.8653738Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8654599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.8655326Z     fn()
2025-05-07T20:31:49.8656042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.8656870Z     self.fn.run(
2025-05-07T20:31:49.8657530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8658273Z     kernel = self.compile(
2025-05-07T20:31:49.8659038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8659952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8660480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8660796Z 
2025-05-07T20:31:49.8661065Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac2a8b0>
2025-05-07T20:31:49.8662668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8664639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a5ba040>}
2025-05-07T20:31:49.8666557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8668145Z context = <triton._C.libtriton.ir.context object at 0x7f317ac780b0>
2025-05-07T20:31:49.8668534Z 
2025-05-07T20:31:49.8668760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8669481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8670135Z                            module_map=module_map)
2025-05-07T20:31:49.8670637Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8671130Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8671517Z E       ^
2025-05-07T20:31:49.8672158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8672770Z 
2025-05-07T20:31:49.8673347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8674050Z 
2025-05-07T20:31:49.8674188Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8674748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8675301Z     T=2048,
2025-05-07T20:31:49.8675572Z     D=5120,
2025-05-07T20:31:49.8675851Z     scale_ub=1200.0,
2025-05-07T20:31:49.8676169Z     contiguous=True,
2025-05-07T20:31:49.8676495Z     compiled=False,
2025-05-07T20:31:49.8676800Z )
2025-05-07T20:31:49.8677262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8677943Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.8678327Z 
2025-05-07T20:31:49.8678443Z     @given(
2025-05-07T20:31:49.8678777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8679217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8679659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8680241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8680701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8681102Z     )
2025-05-07T20:31:49.8681572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8682137Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8682449Z         self,
2025-05-07T20:31:49.8682702Z         T: int,
2025-05-07T20:31:49.8682956Z         D: int,
2025-05-07T20:31:49.8683254Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8683602Z         contiguous: bool,
2025-05-07T20:31:49.8683911Z         compiled: bool,
2025-05-07T20:31:49.8684212Z     ) -> None:
2025-05-07T20:31:49.8684534Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8684897Z     
2025-05-07T20:31:49.8685281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8685783Z     
2025-05-07T20:31:49.8686072Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8686499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8686948Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8687308Z         x0 = x[:, :D]
2025-05-07T20:31:49.8687615Z         x1 = x[:, D:]
2025-05-07T20:31:49.8687931Z     
2025-05-07T20:31:49.8688202Z         if contiguous:
2025-05-07T20:31:49.8688536Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8688919Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8689275Z     
2025-05-07T20:31:49.8689560Z         if scale_ub is not None:
2025-05-07T20:31:49.8689956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8690443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8690895Z             )
2025-05-07T20:31:49.8691184Z         else:
2025-05-07T20:31:49.8691498Z             scale_ub_tensor = None
2025-05-07T20:31:49.8691869Z     
2025-05-07T20:31:49.8692194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8692652Z             op = silu_mul_quant
2025-05-07T20:31:49.8693136Z             if compiled:
2025-05-07T20:31:49.8693488Z                 op = torch.compile(op)
2025-05-07T20:31:49.8693921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8694326Z     
2025-05-07T20:31:49.8694577Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.8694799Z 
2025-05-07T20:31:49.8694928Z moe/activation_test.py:117: 
2025-05-07T20:31:49.8695315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8695737Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.8696090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8696960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.8697831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.8698542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8699459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8700384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8701078Z     kernel = self.compile(
2025-05-07T20:31:49.8701844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8702665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8703182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8703478Z 
2025-05-07T20:31:49.8703763Z self = <triton.compiler.compiler.ASTSource object at 0x7f3176335790>
2025-05-07T20:31:49.8705154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8706968Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a60f9d0>}
2025-05-07T20:31:49.8708728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8710056Z context = <triton._C.libtriton.ir.context object at 0x7f317a523430>
2025-05-07T20:31:49.8710441Z 
2025-05-07T20:31:49.8710682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8711428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8712099Z                            module_map=module_map)
2025-05-07T20:31:49.8712598Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8713077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.8713449Z E       ^
2025-05-07T20:31:49.8714089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8714723Z 
2025-05-07T20:31:49.8715266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8715974Z 
2025-05-07T20:31:49.8716125Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8716695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8717232Z     T=2048,
2025-05-07T20:31:49.8717495Z     D=5120,
2025-05-07T20:31:49.8717762Z     scale_ub=1200.0,
2025-05-07T20:31:49.8718060Z     contiguous=True,
2025-05-07T20:31:49.8718370Z     compiled=True,
2025-05-07T20:31:49.8718649Z )
2025-05-07T20:31:49.8719066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8719741Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.8720194Z 
2025-05-07T20:31:49.8720309Z     @given(
2025-05-07T20:31:49.8720613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8721054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8721486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8721945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8722403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8722793Z     )
2025-05-07T20:31:49.8723263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8723853Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8724203Z         self,
2025-05-07T20:31:49.8724476Z         T: int,
2025-05-07T20:31:49.8724752Z         D: int,
2025-05-07T20:31:49.8725066Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8725447Z         contiguous: bool,
2025-05-07T20:31:49.8725782Z         compiled: bool,
2025-05-07T20:31:49.8726141Z     ) -> None:
2025-05-07T20:31:49.8726446Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8726774Z     
2025-05-07T20:31:49.8727162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8727647Z     
2025-05-07T20:31:49.8727904Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8728295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8728691Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8729039Z         x0 = x[:, :D]
2025-05-07T20:31:49.8729352Z         x1 = x[:, D:]
2025-05-07T20:31:49.8729652Z     
2025-05-07T20:31:49.8729926Z         if contiguous:
2025-05-07T20:31:49.8730229Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8730564Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8730915Z     
2025-05-07T20:31:49.8731186Z         if scale_ub is not None:
2025-05-07T20:31:49.8731574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8732048Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8733099Z             )
2025-05-07T20:31:49.8733376Z         else:
2025-05-07T20:31:49.8733669Z             scale_ub_tensor = None
2025-05-07T20:31:49.8734017Z     
2025-05-07T20:31:49.8734336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8734776Z             op = silu_mul_quant
2025-05-07T20:31:49.8735139Z             if compiled:
2025-05-07T20:31:49.8735474Z                 op = torch.compile(op)
2025-05-07T20:31:49.8735897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8736291Z     
2025-05-07T20:31:49.8736561Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.8736970Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.8737363Z     
2025-05-07T20:31:49.8737672Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8738129Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.8738528Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.8738954Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.8739444Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8739868Z     
2025-05-07T20:31:49.8740451Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8740730Z 
2025-05-07T20:31:49.8740868Z moe/activation_test.py:126: 
2025-05-07T20:31:49.8741345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8741806Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.8742243Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8743348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.8744387Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8745125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8746050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8747178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.8748170Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8749173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.8750208Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8751236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.8752111Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8752971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.8753722Z     fn()
2025-05-07T20:31:49.8754491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.8755277Z     self.fn.run(
2025-05-07T20:31:49.8755909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8756614Z     kernel = self.compile(
2025-05-07T20:31:49.8757323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8758202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8758751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8759074Z 
2025-05-07T20:31:49.8759366Z self = <triton.compiler.compiler.ASTSource object at 0x7f317abb3700>
2025-05-07T20:31:49.8761819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8784050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317a693a60>}
2025-05-07T20:31:49.8785931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8787347Z context = <triton._C.libtriton.ir.context object at 0x7f313c30bd30>
2025-05-07T20:31:49.8787739Z 
2025-05-07T20:31:49.8787982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8788699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8789364Z                            module_map=module_map)
2025-05-07T20:31:49.8789908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8790416Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8790765Z E       ^
2025-05-07T20:31:49.8791378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8791987Z 
2025-05-07T20:31:49.8792554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8793247Z 
2025-05-07T20:31:49.8793407Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8793975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8794536Z     T=16384,
2025-05-07T20:31:49.8794822Z     D=7168,
2025-05-07T20:31:49.8795095Z     scale_ub=1200.0,
2025-05-07T20:31:49.8795419Z     contiguous=False,
2025-05-07T20:31:49.8795743Z     compiled=False,
2025-05-07T20:31:49.8796039Z )
2025-05-07T20:31:49.8796476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8797394Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.8797769Z 
2025-05-07T20:31:49.8797889Z     @given(
2025-05-07T20:31:49.8798190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8798607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8799028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8799467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8799924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8800322Z     )
2025-05-07T20:31:49.8800819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8801450Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8801805Z         self,
2025-05-07T20:31:49.8802086Z         T: int,
2025-05-07T20:31:49.8802363Z         D: int,
2025-05-07T20:31:49.8802680Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8803095Z         contiguous: bool,
2025-05-07T20:31:49.8803420Z         compiled: bool,
2025-05-07T20:31:49.8803732Z     ) -> None:
2025-05-07T20:31:49.8804030Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8804323Z     
2025-05-07T20:31:49.8804602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8804955Z     
2025-05-07T20:31:49.8805150Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8805445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8805765Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8806006Z         x0 = x[:, :D]
2025-05-07T20:31:49.8806227Z         x1 = x[:, D:]
2025-05-07T20:31:49.8806439Z     
2025-05-07T20:31:49.8806627Z         if contiguous:
2025-05-07T20:31:49.8806864Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8807131Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8807375Z     
2025-05-07T20:31:49.8807562Z         if scale_ub is not None:
2025-05-07T20:31:49.8807945Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8808296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8808606Z             )
2025-05-07T20:31:49.8808802Z         else:
2025-05-07T20:31:49.8809017Z             scale_ub_tensor = None
2025-05-07T20:31:49.8809264Z     
2025-05-07T20:31:49.8809502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8809819Z             op = silu_mul_quant
2025-05-07T20:31:49.8810069Z             if compiled:
2025-05-07T20:31:49.8810324Z                 op = torch.compile(op)
2025-05-07T20:31:49.8810624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8810898Z     
2025-05-07T20:31:49.8811091Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.8811261Z 
2025-05-07T20:31:49.8811370Z moe/activation_test.py:117: 
2025-05-07T20:31:49.8811667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8812001Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.8812295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8813002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.8813689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.8814229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8814924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8815595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8816124Z     kernel = self.compile(
2025-05-07T20:31:49.8816669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8817329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8817731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8818057Z 
2025-05-07T20:31:49.8818269Z self = <triton.compiler.compiler.ASTSource object at 0x7f317acedb20>
2025-05-07T20:31:49.8819346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8820735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317ad33700>}
2025-05-07T20:31:49.8822231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8823255Z context = <triton._C.libtriton.ir.context object at 0x7f313c630e30>
2025-05-07T20:31:49.8823559Z 
2025-05-07T20:31:49.8823731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8824257Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8824726Z                            module_map=module_map)
2025-05-07T20:31:49.8825097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8825448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.8825712Z E       ^
2025-05-07T20:31:49.8826176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8826633Z 
2025-05-07T20:31:49.8827046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8827574Z 
2025-05-07T20:31:49.8827678Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8828097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8828585Z     T=1,
2025-05-07T20:31:49.8828776Z     D=7168,
2025-05-07T20:31:49.8828976Z     scale_ub=None,
2025-05-07T20:31:49.8829186Z     contiguous=True,
2025-05-07T20:31:49.8829419Z     compiled=True,
2025-05-07T20:31:49.8829633Z )
2025-05-07T20:31:49.8829950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8830446Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.8830710Z 
2025-05-07T20:31:49.8830796Z     @given(
2025-05-07T20:31:49.8831031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8831352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8831672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8832017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8832354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8832653Z     )
2025-05-07T20:31:49.8833018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8833472Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8833724Z         self,
2025-05-07T20:31:49.8833923Z         T: int,
2025-05-07T20:31:49.8834127Z         D: int,
2025-05-07T20:31:49.8834347Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8834615Z         contiguous: bool,
2025-05-07T20:31:49.8834859Z         compiled: bool,
2025-05-07T20:31:49.8835082Z     ) -> None:
2025-05-07T20:31:49.8835293Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8835537Z     
2025-05-07T20:31:49.8835813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8836154Z     
2025-05-07T20:31:49.8836342Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8836634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8836943Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8837180Z         x0 = x[:, :D]
2025-05-07T20:31:49.8837396Z         x1 = x[:, D:]
2025-05-07T20:31:49.8837789Z     
2025-05-07T20:31:49.8837975Z         if contiguous:
2025-05-07T20:31:49.8838209Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8838473Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8838710Z     
2025-05-07T20:31:49.8838904Z         if scale_ub is not None:
2025-05-07T20:31:49.8839181Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8839514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8839825Z             )
2025-05-07T20:31:49.8840021Z         else:
2025-05-07T20:31:49.8840589Z             scale_ub_tensor = None
2025-05-07T20:31:49.8840953Z     
2025-05-07T20:31:49.8841306Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8841775Z             op = silu_mul_quant
2025-05-07T20:31:49.8842034Z             if compiled:
2025-05-07T20:31:49.8842288Z                 op = torch.compile(op)
2025-05-07T20:31:49.8842591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8842865Z     
2025-05-07T20:31:49.8843074Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.8843365Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.8843657Z     
2025-05-07T20:31:49.8843899Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8844240Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.8844534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.8844854Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.8845220Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8845541Z     
2025-05-07T20:31:49.8845741Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8845944Z 
2025-05-07T20:31:49.8846045Z moe/activation_test.py:126: 
2025-05-07T20:31:49.8846353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8846689Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.8847020Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8848058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.8848812Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8849366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8850053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8850750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.8851471Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8852231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.8852986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8853731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.8854367Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8854984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.8855501Z     fn()
2025-05-07T20:31:49.8856002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.8856583Z     self.fn.run(
2025-05-07T20:31:49.8857052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8857583Z     kernel = self.compile(
2025-05-07T20:31:49.8858118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8858776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8859316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8859547Z 
2025-05-07T20:31:49.8859760Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac0cdf0>
2025-05-07T20:31:49.8860829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8862333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317b531ee0>}
2025-05-07T20:31:49.8863667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8864707Z context = <triton._C.libtriton.ir.context object at 0x7f313d8ad470>
2025-05-07T20:31:49.8864994Z 
2025-05-07T20:31:49.8865165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8865693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8866163Z                            module_map=module_map)
2025-05-07T20:31:49.8866534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8866888Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8867157Z E       ^
2025-05-07T20:31:49.8867618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8868063Z 
2025-05-07T20:31:49.8868482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8869000Z 
2025-05-07T20:31:49.8869102Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8869609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8870014Z     T=4096,
2025-05-07T20:31:49.8870203Z     D=5120,
2025-05-07T20:31:49.8870400Z     scale_ub=None,
2025-05-07T20:31:49.8870624Z     contiguous=False,
2025-05-07T20:31:49.8870848Z     compiled=False,
2025-05-07T20:31:49.8871057Z )
2025-05-07T20:31:49.8871376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8871872Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.8872158Z 
2025-05-07T20:31:49.8872240Z     @given(
2025-05-07T20:31:49.8872473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8872797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8873105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8873441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8873784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8874078Z     )
2025-05-07T20:31:49.8874438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8874893Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8875134Z         self,
2025-05-07T20:31:49.8875333Z         T: int,
2025-05-07T20:31:49.8875534Z         D: int,
2025-05-07T20:31:49.8875751Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8876031Z         contiguous: bool,
2025-05-07T20:31:49.8876281Z         compiled: bool,
2025-05-07T20:31:49.8876505Z     ) -> None:
2025-05-07T20:31:49.8876729Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8876977Z     
2025-05-07T20:31:49.8877252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8877596Z     
2025-05-07T20:31:49.8877791Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8878086Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8878394Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8878730Z         x0 = x[:, :D]
2025-05-07T20:31:49.8878949Z         x1 = x[:, D:]
2025-05-07T20:31:49.8879156Z     
2025-05-07T20:31:49.8879342Z         if contiguous:
2025-05-07T20:31:49.8879577Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8879833Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8880080Z     
2025-05-07T20:31:49.8880279Z         if scale_ub is not None:
2025-05-07T20:31:49.8880551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8880893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8881206Z             )
2025-05-07T20:31:49.8881397Z         else:
2025-05-07T20:31:49.8881623Z             scale_ub_tensor = None
2025-05-07T20:31:49.8881881Z     
2025-05-07T20:31:49.8882115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8882433Z             op = silu_mul_quant
2025-05-07T20:31:49.8882694Z             if compiled:
2025-05-07T20:31:49.8882948Z                 op = torch.compile(op)
2025-05-07T20:31:49.8883257Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8883540Z     
2025-05-07T20:31:49.8883735Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.8883903Z 
2025-05-07T20:31:49.8884006Z moe/activation_test.py:117: 
2025-05-07T20:31:49.8884310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8884649Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.8884931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8885633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.8886323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.8886867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8887550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8888298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8888854Z     kernel = self.compile(
2025-05-07T20:31:49.8889403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8890056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8890460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8890693Z 
2025-05-07T20:31:49.8890912Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d975eb0>
2025-05-07T20:31:49.8891986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8893354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c57af70>}
2025-05-07T20:31:49.8894703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8895733Z context = <triton._C.libtriton.ir.context object at 0x7f313c144070>
2025-05-07T20:31:49.8896020Z 
2025-05-07T20:31:49.8896199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8896720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8897196Z                            module_map=module_map)
2025-05-07T20:31:49.8897565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8897917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.8898181Z E       ^
2025-05-07T20:31:49.8898649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8899183Z 
2025-05-07T20:31:49.8899607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8900118Z 
2025-05-07T20:31:49.8900222Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8900642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8901046Z     T=4096,
2025-05-07T20:31:49.8901288Z     D=7168,
2025-05-07T20:31:49.8901478Z     scale_ub=None,
2025-05-07T20:31:49.8901701Z     contiguous=False,
2025-05-07T20:31:49.8901940Z     compiled=False,
2025-05-07T20:31:49.8902144Z )
2025-05-07T20:31:49.8902466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8902972Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.8903244Z 
2025-05-07T20:31:49.8903323Z     @given(
2025-05-07T20:31:49.8903572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8903894Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8904202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8904534Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8904868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8905161Z     )
2025-05-07T20:31:49.8905715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8906174Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8906422Z         self,
2025-05-07T20:31:49.8906617Z         T: int,
2025-05-07T20:31:49.8906821Z         D: int,
2025-05-07T20:31:49.8907045Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8907319Z         contiguous: bool,
2025-05-07T20:31:49.8907568Z         compiled: bool,
2025-05-07T20:31:49.8907799Z     ) -> None:
2025-05-07T20:31:49.8908013Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8908265Z     
2025-05-07T20:31:49.8908673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8909029Z     
2025-05-07T20:31:49.8909232Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8909535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8909857Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8910105Z         x0 = x[:, :D]
2025-05-07T20:31:49.8910330Z         x1 = x[:, D:]
2025-05-07T20:31:49.8910546Z     
2025-05-07T20:31:49.8910732Z         if contiguous:
2025-05-07T20:31:49.8910973Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8911238Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8911482Z     
2025-05-07T20:31:49.8911682Z         if scale_ub is not None:
2025-05-07T20:31:49.8911963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8912299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8912616Z             )
2025-05-07T20:31:49.8912821Z         else:
2025-05-07T20:31:49.8913040Z             scale_ub_tensor = None
2025-05-07T20:31:49.8913304Z     
2025-05-07T20:31:49.8913544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8913856Z             op = silu_mul_quant
2025-05-07T20:31:49.8914116Z             if compiled:
2025-05-07T20:31:49.8914369Z                 op = torch.compile(op)
2025-05-07T20:31:49.8914675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8914949Z     
2025-05-07T20:31:49.8915146Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.8915314Z 
2025-05-07T20:31:49.8915421Z moe/activation_test.py:117: 
2025-05-07T20:31:49.8915715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8916055Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.8916344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8917034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.8917830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.8918370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8919060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8919721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8920256Z     kernel = self.compile(
2025-05-07T20:31:49.8920801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8921454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8921862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8922100Z 
2025-05-07T20:31:49.8922309Z self = <triton.compiler.compiler.ASTSource object at 0x7f317ac00640>
2025-05-07T20:31:49.8923392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8924762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c48f940>}
2025-05-07T20:31:49.8926086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8927103Z context = <triton._C.libtriton.ir.context object at 0x7f313d8033b0>
2025-05-07T20:31:49.8927398Z 
2025-05-07T20:31:49.8927569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8928097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8928654Z                            module_map=module_map)
2025-05-07T20:31:49.8929035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8929394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.8929657Z E       ^
2025-05-07T20:31:49.8930128Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8930586Z 
2025-05-07T20:31:49.8931009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8931523Z 
2025-05-07T20:31:49.8931634Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8932050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8932459Z     T=128,
2025-05-07T20:31:49.8932655Z     D=7168,
2025-05-07T20:31:49.8932852Z     scale_ub=None,
2025-05-07T20:31:49.8933075Z     contiguous=False,
2025-05-07T20:31:49.8933308Z     compiled=True,
2025-05-07T20:31:49.8933522Z )
2025-05-07T20:31:49.8933849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8934348Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.8934621Z 
2025-05-07T20:31:49.8934711Z     @given(
2025-05-07T20:31:49.8934942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8935267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8935584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8935914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8936252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8936545Z     )
2025-05-07T20:31:49.8936894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8937354Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8937606Z         self,
2025-05-07T20:31:49.8937801Z         T: int,
2025-05-07T20:31:49.8938096Z         D: int,
2025-05-07T20:31:49.8938331Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8938612Z         contiguous: bool,
2025-05-07T20:31:49.8938852Z         compiled: bool,
2025-05-07T20:31:49.8939080Z     ) -> None:
2025-05-07T20:31:49.8939296Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8948836Z     
2025-05-07T20:31:49.8949150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8949502Z     
2025-05-07T20:31:49.8949711Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8950021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8950339Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8950591Z         x0 = x[:, :D]
2025-05-07T20:31:49.8950817Z         x1 = x[:, D:]
2025-05-07T20:31:49.8951037Z     
2025-05-07T20:31:49.8951226Z         if contiguous:
2025-05-07T20:31:49.8951471Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8951742Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8951987Z     
2025-05-07T20:31:49.8952208Z         if scale_ub is not None:
2025-05-07T20:31:49.8952495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8952837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8953162Z             )
2025-05-07T20:31:49.8953371Z         else:
2025-05-07T20:31:49.8953589Z             scale_ub_tensor = None
2025-05-07T20:31:49.8953852Z     
2025-05-07T20:31:49.8954137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8954470Z             op = silu_mul_quant
2025-05-07T20:31:49.8954733Z             if compiled:
2025-05-07T20:31:49.8954989Z                 op = torch.compile(op)
2025-05-07T20:31:49.8955292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8955589Z     
2025-05-07T20:31:49.8955795Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.8956086Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.8956395Z     
2025-05-07T20:31:49.8956643Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8957189Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.8957491Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.8957825Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.8958200Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8958516Z     
2025-05-07T20:31:49.8958732Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.8958934Z 
2025-05-07T20:31:49.8959044Z moe/activation_test.py:126: 
2025-05-07T20:31:49.8959348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8959696Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.8960039Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.8960835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.8961592Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.8962164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.8962856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.8963553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.8964294Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8965062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.8965824Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.8966548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.8967196Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.8967934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.8968471Z     fn()
2025-05-07T20:31:49.8968978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.8969566Z     self.fn.run(
2025-05-07T20:31:49.8970046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.8970582Z     kernel = self.compile(
2025-05-07T20:31:49.8971134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.8971792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.8972202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8972435Z 
2025-05-07T20:31:49.8972648Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d617220>
2025-05-07T20:31:49.8973735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.8975126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f317abe5160>}
2025-05-07T20:31:49.8976480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.8977500Z context = <triton._C.libtriton.ir.context object at 0x7f2fd93168f0>
2025-05-07T20:31:49.8977792Z 
2025-05-07T20:31:49.8977967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.8978606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.8979094Z                            module_map=module_map)
2025-05-07T20:31:49.8979463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.8979829Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.8980113Z E       ^
2025-05-07T20:31:49.8980579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.8981035Z 
2025-05-07T20:31:49.8981531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.8982050Z 
2025-05-07T20:31:49.8982157Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.8982584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.8982995Z     T=128,
2025-05-07T20:31:49.8983183Z     D=7168,
2025-05-07T20:31:49.8983383Z     scale_ub=None,
2025-05-07T20:31:49.8983626Z     contiguous=False,
2025-05-07T20:31:49.8983859Z     compiled=False,
2025-05-07T20:31:49.8984078Z )
2025-05-07T20:31:49.8984405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.8984898Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.8985177Z 
2025-05-07T20:31:49.8985259Z     @given(
2025-05-07T20:31:49.8985500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.8985815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.8986132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.8986471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.8986810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.8987103Z     )
2025-05-07T20:31:49.8987464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.8987920Z     def test_silu_mul_quant(
2025-05-07T20:31:49.8988260Z         self,
2025-05-07T20:31:49.8988465Z         T: int,
2025-05-07T20:31:49.8988669Z         D: int,
2025-05-07T20:31:49.8988894Z         scale_ub: Optional[float],
2025-05-07T20:31:49.8989176Z         contiguous: bool,
2025-05-07T20:31:49.8989427Z         compiled: bool,
2025-05-07T20:31:49.8989657Z     ) -> None:
2025-05-07T20:31:49.8989890Z         torch.manual_seed(2025)
2025-05-07T20:31:49.8990149Z     
2025-05-07T20:31:49.8990425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.8990777Z     
2025-05-07T20:31:49.8990982Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.8991277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.8991602Z         x = x_sign * x_clamp
2025-05-07T20:31:49.8991863Z         x0 = x[:, :D]
2025-05-07T20:31:49.8992095Z         x1 = x[:, D:]
2025-05-07T20:31:49.8992307Z     
2025-05-07T20:31:49.8992501Z         if contiguous:
2025-05-07T20:31:49.8992746Z             x0 = x0.contiguous()
2025-05-07T20:31:49.8993021Z             x1 = x1.contiguous()
2025-05-07T20:31:49.8993265Z     
2025-05-07T20:31:49.8993459Z         if scale_ub is not None:
2025-05-07T20:31:49.8993742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.8994119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.8994444Z             )
2025-05-07T20:31:49.8994641Z         else:
2025-05-07T20:31:49.8994859Z             scale_ub_tensor = None
2025-05-07T20:31:49.8995111Z     
2025-05-07T20:31:49.8995348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.8995666Z             op = silu_mul_quant
2025-05-07T20:31:49.8995917Z             if compiled:
2025-05-07T20:31:49.8996170Z                 op = torch.compile(op)
2025-05-07T20:31:49.8996477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8996757Z     
2025-05-07T20:31:49.8996950Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.8997124Z 
2025-05-07T20:31:49.8997226Z moe/activation_test.py:117: 
2025-05-07T20:31:49.8997638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.8997977Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.8998262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.8998958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.8999641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9000182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9000868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9001536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9002066Z     kernel = self.compile(
2025-05-07T20:31:49.9002623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9003286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9003691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9003924Z 
2025-05-07T20:31:49.9004134Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd92b2e20>
2025-05-07T20:31:49.9005215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9006574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313c07bdc0>}
2025-05-07T20:31:49.9007913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9009021Z context = <triton._C.libtriton.ir.context object at 0x7f313c0b0770>
2025-05-07T20:31:49.9009317Z 
2025-05-07T20:31:49.9009490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9010018Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9010486Z                            module_map=module_map)
2025-05-07T20:31:49.9010853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9011214Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9011479Z E       ^
2025-05-07T20:31:49.9011944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9012405Z 
2025-05-07T20:31:49.9012829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9013356Z 
2025-05-07T20:31:49.9013462Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9013884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9014285Z     T=4096,
2025-05-07T20:31:49.9014480Z     D=5120,
2025-05-07T20:31:49.9014684Z     scale_ub=1200.0,
2025-05-07T20:31:49.9014906Z     contiguous=True,
2025-05-07T20:31:49.9015141Z     compiled=False,
2025-05-07T20:31:49.9015351Z )
2025-05-07T20:31:49.9015670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9016173Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9016455Z 
2025-05-07T20:31:49.9016532Z     @given(
2025-05-07T20:31:49.9016766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9017080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9017393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9017809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9018150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9018442Z     )
2025-05-07T20:31:49.9018792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9019232Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9019480Z         self,
2025-05-07T20:31:49.9019677Z         T: int,
2025-05-07T20:31:49.9019875Z         D: int,
2025-05-07T20:31:49.9020100Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9020372Z         contiguous: bool,
2025-05-07T20:31:49.9020615Z         compiled: bool,
2025-05-07T20:31:49.9020838Z     ) -> None:
2025-05-07T20:31:49.9021054Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9021359Z     
2025-05-07T20:31:49.9021629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9021973Z     
2025-05-07T20:31:49.9022169Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9022464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9022782Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9023027Z         x0 = x[:, :D]
2025-05-07T20:31:49.9023243Z         x1 = x[:, D:]
2025-05-07T20:31:49.9023452Z     
2025-05-07T20:31:49.9023640Z         if contiguous:
2025-05-07T20:31:49.9023873Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9024160Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9024433Z     
2025-05-07T20:31:49.9024623Z         if scale_ub is not None:
2025-05-07T20:31:49.9024900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9025237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9025552Z             )
2025-05-07T20:31:49.9025746Z         else:
2025-05-07T20:31:49.9025960Z             scale_ub_tensor = None
2025-05-07T20:31:49.9026215Z     
2025-05-07T20:31:49.9026444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9026759Z             op = silu_mul_quant
2025-05-07T20:31:49.9027012Z             if compiled:
2025-05-07T20:31:49.9027350Z                 op = torch.compile(op)
2025-05-07T20:31:49.9027649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9027929Z     
2025-05-07T20:31:49.9028118Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9028288Z 
2025-05-07T20:31:49.9028387Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9028690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9029021Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9029313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9030008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9030694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9031229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9031921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9032598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9033135Z     kernel = self.compile(
2025-05-07T20:31:49.9033673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9034334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9034731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9034961Z 
2025-05-07T20:31:49.9035170Z self = <triton.compiler.compiler.ASTSource object at 0x7f317a3d0d30>
2025-05-07T20:31:49.9036246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9037689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313d5650d0>}
2025-05-07T20:31:49.9039031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9040046Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8ea2a70>
2025-05-07T20:31:49.9040623Z 
2025-05-07T20:31:49.9040799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9041322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9041785Z                            module_map=module_map)
2025-05-07T20:31:49.9042149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9042504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9042775Z E       ^
2025-05-07T20:31:49.9043246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9043697Z 
2025-05-07T20:31:49.9044118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9044640Z 
2025-05-07T20:31:49.9044744Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9045158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9045560Z     T=1,
2025-05-07T20:31:49.9045737Z     D=5120,
2025-05-07T20:31:49.9045928Z     scale_ub=None,
2025-05-07T20:31:49.9046143Z     contiguous=True,
2025-05-07T20:31:49.9046359Z     compiled=True,
2025-05-07T20:31:49.9046568Z )
2025-05-07T20:31:49.9046893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9047377Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9047780Z 
2025-05-07T20:31:49.9047864Z     @given(
2025-05-07T20:31:49.9048096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9048410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9048722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9049052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9049382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9049670Z     )
2025-05-07T20:31:49.9050021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9050465Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9050706Z         self,
2025-05-07T20:31:49.9050901Z         T: int,
2025-05-07T20:31:49.9051099Z         D: int,
2025-05-07T20:31:49.9051317Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9051412Z         contiguous: bool,
2025-05-07T20:31:49.9051499Z         compiled: bool,
2025-05-07T20:31:49.9051579Z     ) -> None:
2025-05-07T20:31:49.9051692Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9051766Z     
2025-05-07T20:31:49.9051936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9052017Z     
2025-05-07T20:31:49.9052108Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9052241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9052329Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9052410Z         x0 = x[:, :D]
2025-05-07T20:31:49.9052496Z         x1 = x[:, D:]
2025-05-07T20:31:49.9052569Z     
2025-05-07T20:31:49.9052654Z         if contiguous:
2025-05-07T20:31:49.9052749Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9052839Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9052911Z     
2025-05-07T20:31:49.9053006Z         if scale_ub is not None:
2025-05-07T20:31:49.9053111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9053247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9053336Z             )
2025-05-07T20:31:49.9053530Z         else:
2025-05-07T20:31:49.9053627Z             scale_ub_tensor = None
2025-05-07T20:31:49.9053707Z     
2025-05-07T20:31:49.9053840Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9053938Z             op = silu_mul_quant
2025-05-07T20:31:49.9054037Z             if compiled:
2025-05-07T20:31:49.9054152Z                 op = torch.compile(op)
2025-05-07T20:31:49.9054289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9054362Z     
2025-05-07T20:31:49.9054454Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9054579Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9054652Z     
2025-05-07T20:31:49.9054787Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9054893Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9054992Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9055119Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9055275Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9055350Z     
2025-05-07T20:31:49.9055453Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9055457Z 
2025-05-07T20:31:49.9055557Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9055689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9055798Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9055933Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9056492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9056604Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9056964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9057200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9057674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9057936Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9058341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9058597Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9058979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9059146Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9059488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9059569Z     fn()
2025-05-07T20:31:49.9059971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9060057Z     self.fn.run(
2025-05-07T20:31:49.9060405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9060500Z     kernel = self.compile(
2025-05-07T20:31:49.9060880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9061063Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9061246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9061251Z 
2025-05-07T20:31:49.9061464Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d533760>
2025-05-07T20:31:49.9062305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9062825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313d565280>}
2025-05-07T20:31:49.9063561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9063757Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8e91eb0>
2025-05-07T20:31:49.9063762Z 
2025-05-07T20:31:49.9063929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9064240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9064353Z                            module_map=module_map)
2025-05-07T20:31:49.9064514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9064627Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9064712Z E       ^
2025-05-07T20:31:49.9065067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9065072Z 
2025-05-07T20:31:49.9065488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9065492Z 
2025-05-07T20:31:49.9065595Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9065816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9065895Z     T=2048,
2025-05-07T20:31:49.9065970Z     D=5120,
2025-05-07T20:31:49.9066050Z     scale_ub=None,
2025-05-07T20:31:49.9066139Z     contiguous=True,
2025-05-07T20:31:49.9066221Z     compiled=True,
2025-05-07T20:31:49.9066294Z )
2025-05-07T20:31:49.9066517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9066776Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9066780Z 
2025-05-07T20:31:49.9066862Z     @given(
2025-05-07T20:31:49.9066981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9067088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9067204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9067321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9067441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9067518Z     )
2025-05-07T20:31:49.9067765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9067864Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9067943Z         self,
2025-05-07T20:31:49.9068023Z         T: int,
2025-05-07T20:31:49.9068105Z         D: int,
2025-05-07T20:31:49.9068206Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9068295Z         contiguous: bool,
2025-05-07T20:31:49.9068398Z         compiled: bool,
2025-05-07T20:31:49.9068479Z     ) -> None:
2025-05-07T20:31:49.9068576Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9068648Z     
2025-05-07T20:31:49.9068818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9068896Z     
2025-05-07T20:31:49.9068988Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9069115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9069209Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9069290Z         x0 = x[:, :D]
2025-05-07T20:31:49.9069371Z         x1 = x[:, D:]
2025-05-07T20:31:49.9069449Z     
2025-05-07T20:31:49.9069534Z         if contiguous:
2025-05-07T20:31:49.9069625Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9069723Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9069796Z     
2025-05-07T20:31:49.9069891Z         if scale_ub is not None:
2025-05-07T20:31:49.9069996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9070216Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9070297Z             )
2025-05-07T20:31:49.9070373Z         else:
2025-05-07T20:31:49.9070466Z             scale_ub_tensor = None
2025-05-07T20:31:49.9070543Z     
2025-05-07T20:31:49.9070675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9070769Z             op = silu_mul_quant
2025-05-07T20:31:49.9070859Z             if compiled:
2025-05-07T20:31:49.9070960Z                 op = torch.compile(op)
2025-05-07T20:31:49.9071068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9071147Z     
2025-05-07T20:31:49.9071237Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9071359Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9071435Z     
2025-05-07T20:31:49.9071569Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9071675Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9071776Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9071913Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9072055Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9072129Z     
2025-05-07T20:31:49.9072229Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9072234Z 
2025-05-07T20:31:49.9072336Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9072465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9072575Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9072711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9073266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9073372Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9073732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9074044Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9074417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9074670Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9075074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9075325Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9075694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9075864Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9076203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9076292Z     fn()
2025-05-07T20:31:49.9076688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9076772Z     self.fn.run(
2025-05-07T20:31:49.9077123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9077217Z     kernel = self.compile(
2025-05-07T20:31:49.9077600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9077782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9077909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9077913Z 
2025-05-07T20:31:49.9078120Z self = <triton.compiler.compiler.ASTSource object at 0x7f313c090bb0>
2025-05-07T20:31:49.9078992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9079504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd947f0d0>}
2025-05-07T20:31:49.9080243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9080436Z context = <triton._C.libtriton.ir.context object at 0x7f2fd8c32f30>
2025-05-07T20:31:49.9080441Z 
2025-05-07T20:31:49.9080612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9080875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9080991Z                            module_map=module_map)
2025-05-07T20:31:49.9081157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9081260Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9081345Z E       ^
2025-05-07T20:31:49.9081696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9081700Z 
2025-05-07T20:31:49.9082113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9082118Z 
2025-05-07T20:31:49.9082230Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9082451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9082535Z     T=128,
2025-05-07T20:31:49.9082610Z     D=5120,
2025-05-07T20:31:49.9082694Z     scale_ub=None,
2025-05-07T20:31:49.9082786Z     contiguous=True,
2025-05-07T20:31:49.9082869Z     compiled=True,
2025-05-07T20:31:49.9082942Z )
2025-05-07T20:31:49.9083251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9083423Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9083427Z 
2025-05-07T20:31:49.9083504Z     @given(
2025-05-07T20:31:49.9083627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9083726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9083849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9083976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9084110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9084207Z     )
2025-05-07T20:31:49.9084460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9084555Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9084634Z         self,
2025-05-07T20:31:49.9084711Z         T: int,
2025-05-07T20:31:49.9084788Z         D: int,
2025-05-07T20:31:49.9084907Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9084997Z         contiguous: bool,
2025-05-07T20:31:49.9085081Z         compiled: bool,
2025-05-07T20:31:49.9085164Z     ) -> None:
2025-05-07T20:31:49.9085258Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9085334Z     
2025-05-07T20:31:49.9085503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9085578Z     
2025-05-07T20:31:49.9085676Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9085803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9085894Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9085977Z         x0 = x[:, :D]
2025-05-07T20:31:49.9086055Z         x1 = x[:, D:]
2025-05-07T20:31:49.9086129Z     
2025-05-07T20:31:49.9086215Z         if contiguous:
2025-05-07T20:31:49.9086311Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9086401Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9086482Z     
2025-05-07T20:31:49.9086576Z         if scale_ub is not None:
2025-05-07T20:31:49.9086771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9086911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9086986Z             )
2025-05-07T20:31:49.9087068Z         else:
2025-05-07T20:31:49.9087162Z             scale_ub_tensor = None
2025-05-07T20:31:49.9087234Z     
2025-05-07T20:31:49.9087367Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9087457Z             op = silu_mul_quant
2025-05-07T20:31:49.9087543Z             if compiled:
2025-05-07T20:31:49.9087649Z                 op = torch.compile(op)
2025-05-07T20:31:49.9087756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9087831Z     
2025-05-07T20:31:49.9087925Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9088045Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9088122Z     
2025-05-07T20:31:49.9088256Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9088368Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9088471Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9088594Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9088734Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9088811Z     
2025-05-07T20:31:49.9088910Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9088914Z 
2025-05-07T20:31:49.9089011Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9089151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9089259Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9089398Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9094879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9095004Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9095513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9095741Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9096121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9096376Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9096777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9097031Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9097409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9097581Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9097946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9098026Z     fn()
2025-05-07T20:31:49.9098435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9098518Z     self.fn.run(
2025-05-07T20:31:49.9098858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9098959Z     kernel = self.compile(
2025-05-07T20:31:49.9099334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9099515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9099642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9099648Z 
2025-05-07T20:31:49.9099853Z self = <triton.compiler.compiler.ASTSource object at 0x7f313d535250>
2025-05-07T20:31:49.9100713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9101299Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd90d25e0>}
2025-05-07T20:31:49.9102052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9102246Z context = <triton._C.libtriton.ir.context object at 0x7f2fd868d0b0>
2025-05-07T20:31:49.9102251Z 
2025-05-07T20:31:49.9102417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9102691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9102807Z                            module_map=module_map)
2025-05-07T20:31:49.9102974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9103077Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9103153Z E       ^
2025-05-07T20:31:49.9103514Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9103519Z 
2025-05-07T20:31:49.9103937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9103941Z 
2025-05-07T20:31:49.9104047Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9104267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9104343Z     T=4096,
2025-05-07T20:31:49.9104422Z     D=5120,
2025-05-07T20:31:49.9104505Z     scale_ub=None,
2025-05-07T20:31:49.9104669Z     contiguous=True,
2025-05-07T20:31:49.9104762Z     compiled=True,
2025-05-07T20:31:49.9104838Z )
2025-05-07T20:31:49.9105055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9105230Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9105234Z 
2025-05-07T20:31:49.9105312Z     @given(
2025-05-07T20:31:49.9105436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9105534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9105649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9105772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9105887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9105960Z     )
2025-05-07T20:31:49.9106208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9106301Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9106379Z         self,
2025-05-07T20:31:49.9106469Z         T: int,
2025-05-07T20:31:49.9106548Z         D: int,
2025-05-07T20:31:49.9106649Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9106739Z         contiguous: bool,
2025-05-07T20:31:49.9106824Z         compiled: bool,
2025-05-07T20:31:49.9106905Z     ) -> None:
2025-05-07T20:31:49.9106999Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9107072Z     
2025-05-07T20:31:49.9107250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9107323Z     
2025-05-07T20:31:49.9107414Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9107545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9107634Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9107716Z         x0 = x[:, :D]
2025-05-07T20:31:49.9107798Z         x1 = x[:, D:]
2025-05-07T20:31:49.9107871Z     
2025-05-07T20:31:49.9107954Z         if contiguous:
2025-05-07T20:31:49.9108048Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9108136Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9108294Z     
2025-05-07T20:31:49.9108388Z         if scale_ub is not None:
2025-05-07T20:31:49.9108493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9108635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9108711Z             )
2025-05-07T20:31:49.9108786Z         else:
2025-05-07T20:31:49.9108881Z             scale_ub_tensor = None
2025-05-07T20:31:49.9108953Z     
2025-05-07T20:31:49.9109084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9109178Z             op = silu_mul_quant
2025-05-07T20:31:49.9109263Z             if compiled:
2025-05-07T20:31:49.9109361Z                 op = torch.compile(op)
2025-05-07T20:31:49.9109470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9109541Z     
2025-05-07T20:31:49.9109637Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9109758Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9109834Z     
2025-05-07T20:31:49.9109984Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9110087Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9110186Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9110313Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9110452Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9110523Z     
2025-05-07T20:31:49.9110628Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9110633Z 
2025-05-07T20:31:49.9110732Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9110865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9110969Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9111104Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9111674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9111859Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9112225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9112459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9112821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9113078Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9113472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9113721Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9114094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9114274Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9114624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9114703Z     fn()
2025-05-07T20:31:49.9115098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9115183Z     self.fn.run(
2025-05-07T20:31:49.9115516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9115608Z     kernel = self.compile(
2025-05-07T20:31:49.9115987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9116166Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9116294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9116304Z 
2025-05-07T20:31:49.9116588Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd8ab5100>
2025-05-07T20:31:49.9117360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9117871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd91a5940>}
2025-05-07T20:31:49.9118618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9118814Z context = <triton._C.libtriton.ir.context object at 0x7f2fd80781f0>
2025-05-07T20:31:49.9118818Z 
2025-05-07T20:31:49.9118993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9119269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9119376Z                            module_map=module_map)
2025-05-07T20:31:49.9119542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9119646Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9119724Z E       ^
2025-05-07T20:31:49.9120074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9120079Z 
2025-05-07T20:31:49.9120490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9120495Z 
2025-05-07T20:31:49.9120596Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9120819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9120896Z     T=16384,
2025-05-07T20:31:49.9121054Z     D=5120,
2025-05-07T20:31:49.9121143Z     scale_ub=None,
2025-05-07T20:31:49.9121227Z     contiguous=True,
2025-05-07T20:31:49.9121309Z     compiled=True,
2025-05-07T20:31:49.9121384Z )
2025-05-07T20:31:49.9121600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9121773Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9121777Z 
2025-05-07T20:31:49.9121860Z     @given(
2025-05-07T20:31:49.9121979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9122081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9122195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9122313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9122431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9122505Z     )
2025-05-07T20:31:49.9122756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9122863Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9122938Z         self,
2025-05-07T20:31:49.9123016Z         T: int,
2025-05-07T20:31:49.9123097Z         D: int,
2025-05-07T20:31:49.9123195Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9123285Z         contiguous: bool,
2025-05-07T20:31:49.9123372Z         compiled: bool,
2025-05-07T20:31:49.9123449Z     ) -> None:
2025-05-07T20:31:49.9123549Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9123620Z     
2025-05-07T20:31:49.9123789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9123866Z     
2025-05-07T20:31:49.9123960Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9124088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9124179Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9124259Z         x0 = x[:, :D]
2025-05-07T20:31:49.9124338Z         x1 = x[:, D:]
2025-05-07T20:31:49.9124414Z     
2025-05-07T20:31:49.9124496Z         if contiguous:
2025-05-07T20:31:49.9124692Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9124788Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9124860Z     
2025-05-07T20:31:49.9124953Z         if scale_ub is not None:
2025-05-07T20:31:49.9125058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9125195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9125276Z             )
2025-05-07T20:31:49.9125353Z         else:
2025-05-07T20:31:49.9125446Z             scale_ub_tensor = None
2025-05-07T20:31:49.9125520Z     
2025-05-07T20:31:49.9125650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9125740Z             op = silu_mul_quant
2025-05-07T20:31:49.9125831Z             if compiled:
2025-05-07T20:31:49.9125932Z                 op = torch.compile(op)
2025-05-07T20:31:49.9126037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9126116Z     
2025-05-07T20:31:49.9126205Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9126340Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9126413Z     
2025-05-07T20:31:49.9126550Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9126654Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9126754Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9126880Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9127029Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9127103Z     
2025-05-07T20:31:49.9127204Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9127209Z 
2025-05-07T20:31:49.9127312Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9127440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9127548Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9127683Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9128241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9128420Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9128788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9129014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9129380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9129636Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9130040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9130290Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9130672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9130851Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9131189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9131276Z     fn()
2025-05-07T20:31:49.9131679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9131760Z     self.fn.run(
2025-05-07T20:31:49.9132095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9132187Z     kernel = self.compile(
2025-05-07T20:31:49.9132569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9132752Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9132958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9132963Z 
2025-05-07T20:31:49.9133177Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd811f5e0>
2025-05-07T20:31:49.9133960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9134511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d56940>}
2025-05-07T20:31:49.9135252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9135444Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7a448b0>
2025-05-07T20:31:49.9135459Z 
2025-05-07T20:31:49.9135623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9135892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9135999Z                            module_map=module_map)
2025-05-07T20:31:49.9136164Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9136265Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9136342Z E       ^
2025-05-07T20:31:49.9136695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9136699Z 
2025-05-07T20:31:49.9137110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9137114Z 
2025-05-07T20:31:49.9137220Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9137446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9137597Z     T=1,
2025-05-07T20:31:49.9137677Z     D=5120,
2025-05-07T20:31:49.9137761Z     scale_ub=1200.0,
2025-05-07T20:31:49.9137844Z     contiguous=True,
2025-05-07T20:31:49.9137929Z     compiled=True,
2025-05-07T20:31:49.9138001Z )
2025-05-07T20:31:49.9138218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9138390Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9138395Z 
2025-05-07T20:31:49.9138473Z     @given(
2025-05-07T20:31:49.9138596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9138693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9138806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9138925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9139038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9139111Z     )
2025-05-07T20:31:49.9139376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9139476Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9139552Z         self,
2025-05-07T20:31:49.9139631Z         T: int,
2025-05-07T20:31:49.9139708Z         D: int,
2025-05-07T20:31:49.9139806Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9139900Z         contiguous: bool,
2025-05-07T20:31:49.9139986Z         compiled: bool,
2025-05-07T20:31:49.9140356Z     ) -> None:
2025-05-07T20:31:49.9140503Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9140581Z     
2025-05-07T20:31:49.9140757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9140829Z     
2025-05-07T20:31:49.9140920Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9141046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9141183Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9141264Z         x0 = x[:, :D]
2025-05-07T20:31:49.9141352Z         x1 = x[:, D:]
2025-05-07T20:31:49.9141572Z     
2025-05-07T20:31:49.9141660Z         if contiguous:
2025-05-07T20:31:49.9141755Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9141845Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9141926Z     
2025-05-07T20:31:49.9142015Z         if scale_ub is not None:
2025-05-07T20:31:49.9142123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9142263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9142338Z             )
2025-05-07T20:31:49.9142414Z         else:
2025-05-07T20:31:49.9142509Z             scale_ub_tensor = None
2025-05-07T20:31:49.9142581Z     
2025-05-07T20:31:49.9142711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9142803Z             op = silu_mul_quant
2025-05-07T20:31:49.9142887Z             if compiled:
2025-05-07T20:31:49.9142986Z                 op = torch.compile(op)
2025-05-07T20:31:49.9143095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9143181Z     
2025-05-07T20:31:49.9143275Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9143279Z 
2025-05-07T20:31:49.9143379Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9143505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9143612Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9143712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9144081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9144179Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9144668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9144770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9145126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9145355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9145824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9145916Z     kernel = self.compile(
2025-05-07T20:31:49.9146292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9146471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9146595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9146599Z 
2025-05-07T20:31:49.9146807Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd8955a30>
2025-05-07T20:31:49.9147570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9148089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd82f5ca0>}
2025-05-07T20:31:49.9148839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9149029Z context = <triton._C.libtriton.ir.context object at 0x7f2fd75919b0>
2025-05-07T20:31:49.9149033Z 
2025-05-07T20:31:49.9149202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9149463Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9149569Z                            module_map=module_map)
2025-05-07T20:31:49.9149736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9149834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9149991Z E       ^
2025-05-07T20:31:49.9150342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9150347Z 
2025-05-07T20:31:49.9150757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9150761Z 
2025-05-07T20:31:49.9150873Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9151092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9151177Z     T=1,
2025-05-07T20:31:49.9151252Z     D=5120,
2025-05-07T20:31:49.9151332Z     scale_ub=None,
2025-05-07T20:31:49.9151421Z     contiguous=False,
2025-05-07T20:31:49.9151502Z     compiled=True,
2025-05-07T20:31:49.9151574Z )
2025-05-07T20:31:49.9151797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9151962Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9151976Z 
2025-05-07T20:31:49.9152053Z     @given(
2025-05-07T20:31:49.9152174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9152271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9152387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9152502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9152614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9152690Z     )
2025-05-07T20:31:49.9152935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9153028Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9153106Z         self,
2025-05-07T20:31:49.9153181Z         T: int,
2025-05-07T20:31:49.9153258Z         D: int,
2025-05-07T20:31:49.9153357Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9153446Z         contiguous: bool,
2025-05-07T20:31:49.9153530Z         compiled: bool,
2025-05-07T20:31:49.9153691Z     ) -> None:
2025-05-07T20:31:49.9153790Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9153869Z     
2025-05-07T20:31:49.9154061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9154145Z     
2025-05-07T20:31:49.9154251Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9154375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9154462Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9154543Z         x0 = x[:, :D]
2025-05-07T20:31:49.9154623Z         x1 = x[:, D:]
2025-05-07T20:31:49.9154698Z     
2025-05-07T20:31:49.9154786Z         if contiguous:
2025-05-07T20:31:49.9154875Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9154963Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9155036Z     
2025-05-07T20:31:49.9155125Z         if scale_ub is not None:
2025-05-07T20:31:49.9155229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9155370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9155455Z             )
2025-05-07T20:31:49.9155533Z         else:
2025-05-07T20:31:49.9155625Z             scale_ub_tensor = None
2025-05-07T20:31:49.9155697Z     
2025-05-07T20:31:49.9155829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9155917Z             op = silu_mul_quant
2025-05-07T20:31:49.9156002Z             if compiled:
2025-05-07T20:31:49.9156105Z                 op = torch.compile(op)
2025-05-07T20:31:49.9156210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9156282Z     
2025-05-07T20:31:49.9156376Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9156495Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9156566Z     
2025-05-07T20:31:49.9156704Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9156805Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9156910Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9157132Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9157281Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9157355Z     
2025-05-07T20:31:49.9157454Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9157459Z 
2025-05-07T20:31:49.9157555Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9157683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9157786Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9157920Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9158480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9158582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9158940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9159173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9159546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9159810Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9160202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9160458Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9160829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9160995Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9161341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9161417Z     fn()
2025-05-07T20:31:49.9161904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9161984Z     self.fn.run(
2025-05-07T20:31:49.9162318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9162414Z     kernel = self.compile(
2025-05-07T20:31:49.9162794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9162970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9163101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9163106Z 
2025-05-07T20:31:49.9163310Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7577760>
2025-05-07T20:31:49.9164079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9164584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd838f280>}
2025-05-07T20:31:49.9165321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9165511Z context = <triton._C.libtriton.ir.context object at 0x7f2fd754abb0>
2025-05-07T20:31:49.9165516Z 
2025-05-07T20:31:49.9165681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9165951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9166057Z                            module_map=module_map)
2025-05-07T20:31:49.9166289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9166405Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9166481Z E       ^
2025-05-07T20:31:49.9166834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9166839Z 
2025-05-07T20:31:49.9167250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9167255Z 
2025-05-07T20:31:49.9167356Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9167580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9167659Z     T=1,
2025-05-07T20:31:49.9167740Z     D=5120,
2025-05-07T20:31:49.9167820Z     scale_ub=None,
2025-05-07T20:31:49.9167903Z     contiguous=True,
2025-05-07T20:31:49.9167988Z     compiled=False,
2025-05-07T20:31:49.9168060Z )
2025-05-07T20:31:49.9168279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9168449Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9168454Z 
2025-05-07T20:31:49.9168530Z     @given(
2025-05-07T20:31:49.9168650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9168750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9168864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9168985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9169096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9169170Z     )
2025-05-07T20:31:49.9169417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9169509Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9169583Z         self,
2025-05-07T20:31:49.9169658Z         T: int,
2025-05-07T20:31:49.9169734Z         D: int,
2025-05-07T20:31:49.9169832Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9169924Z         contiguous: bool,
2025-05-07T20:31:49.9170092Z         compiled: bool,
2025-05-07T20:31:49.9170169Z     ) -> None:
2025-05-07T20:31:49.9170266Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9170338Z     
2025-05-07T20:31:49.9170509Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9170583Z     
2025-05-07T20:31:49.9170675Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9170802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9170892Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9170972Z         x0 = x[:, :D]
2025-05-07T20:31:49.9171053Z         x1 = x[:, D:]
2025-05-07T20:31:49.9171125Z     
2025-05-07T20:31:49.9171209Z         if contiguous:
2025-05-07T20:31:49.9171305Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9171393Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9171464Z     
2025-05-07T20:31:49.9171560Z         if scale_ub is not None:
2025-05-07T20:31:49.9171665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9171812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9171892Z             )
2025-05-07T20:31:49.9171968Z         else:
2025-05-07T20:31:49.9172064Z             scale_ub_tensor = None
2025-05-07T20:31:49.9172137Z     
2025-05-07T20:31:49.9172270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9172361Z             op = silu_mul_quant
2025-05-07T20:31:49.9172447Z             if compiled:
2025-05-07T20:31:49.9172545Z                 op = torch.compile(op)
2025-05-07T20:31:49.9172654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9172725Z     
2025-05-07T20:31:49.9172814Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9172818Z 
2025-05-07T20:31:49.9172916Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9173041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9173146Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9173322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9173828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9173928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9174333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9174557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9174902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9174994Z     kernel = self.compile(
2025-05-07T20:31:49.9175373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9175551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9175684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9175691Z 
2025-05-07T20:31:49.9175898Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7d6c1c0>
2025-05-07T20:31:49.9176659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9177173Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d56040>}
2025-05-07T20:31:49.9177912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9178103Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7c355b0>
2025-05-07T20:31:49.9178110Z 
2025-05-07T20:31:49.9178356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9178623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9178733Z                            module_map=module_map)
2025-05-07T20:31:49.9178893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9178989Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9179067Z E       ^
2025-05-07T20:31:49.9179416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9179421Z 
2025-05-07T20:31:49.9179835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9179840Z 
2025-05-07T20:31:49.9179941Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9180162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9180253Z     T=128,
2025-05-07T20:31:49.9180330Z     D=5120,
2025-05-07T20:31:49.9180411Z     scale_ub=None,
2025-05-07T20:31:49.9180499Z     contiguous=False,
2025-05-07T20:31:49.9180580Z     compiled=True,
2025-05-07T20:31:49.9180651Z )
2025-05-07T20:31:49.9180868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9181038Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9181043Z 
2025-05-07T20:31:49.9181121Z     @given(
2025-05-07T20:31:49.9181300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9181399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9181521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9181636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9181747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9181822Z     )
2025-05-07T20:31:49.9182142Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9182248Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9182322Z         self,
2025-05-07T20:31:49.9182398Z         T: int,
2025-05-07T20:31:49.9182476Z         D: int,
2025-05-07T20:31:49.9182573Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9182661Z         contiguous: bool,
2025-05-07T20:31:49.9182747Z         compiled: bool,
2025-05-07T20:31:49.9182824Z     ) -> None:
2025-05-07T20:31:49.9182918Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9182994Z     
2025-05-07T20:31:49.9183162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9183236Z     
2025-05-07T20:31:49.9183329Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9183452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9183541Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9183626Z         x0 = x[:, :D]
2025-05-07T20:31:49.9183707Z         x1 = x[:, D:]
2025-05-07T20:31:49.9183785Z     
2025-05-07T20:31:49.9183877Z         if contiguous:
2025-05-07T20:31:49.9183967Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9184059Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9184130Z     
2025-05-07T20:31:49.9184219Z         if scale_ub is not None:
2025-05-07T20:31:49.9184324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9184459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9184535Z             )
2025-05-07T20:31:49.9184617Z         else:
2025-05-07T20:31:49.9184708Z             scale_ub_tensor = None
2025-05-07T20:31:49.9184779Z     
2025-05-07T20:31:49.9184911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9184999Z             op = silu_mul_quant
2025-05-07T20:31:49.9185086Z             if compiled:
2025-05-07T20:31:49.9185185Z                 op = torch.compile(op)
2025-05-07T20:31:49.9185291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9185366Z     
2025-05-07T20:31:49.9185562Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9185566Z 
2025-05-07T20:31:49.9185663Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9185795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9185895Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9185994Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9186363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9186456Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9186957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9187055Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9187411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9187647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9187991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9188082Z     kernel = self.compile(
2025-05-07T20:31:49.9188461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9188634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9188761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9188766Z 
2025-05-07T20:31:49.9188971Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd83cd970>
2025-05-07T20:31:49.9189733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9190309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd912ac10>}
2025-05-07T20:31:49.9191062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9191258Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7957e70>
2025-05-07T20:31:49.9191262Z 
2025-05-07T20:31:49.9191429Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9191694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9191799Z                            module_map=module_map)
2025-05-07T20:31:49.9191961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9192059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9192134Z E       ^
2025-05-07T20:31:49.9192494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9192499Z 
2025-05-07T20:31:49.9192920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9192924Z 
2025-05-07T20:31:49.9193024Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9193248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9193324Z     T=128,
2025-05-07T20:31:49.9193400Z     D=7168,
2025-05-07T20:31:49.9193484Z     scale_ub=1200.0,
2025-05-07T20:31:49.9193567Z     contiguous=False,
2025-05-07T20:31:49.9193650Z     compiled=False,
2025-05-07T20:31:49.9193724Z )
2025-05-07T20:31:49.9193940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9194111Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9194196Z 
2025-05-07T20:31:49.9194277Z     @given(
2025-05-07T20:31:49.9194395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9194495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9194610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9194727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9194844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9194921Z     )
2025-05-07T20:31:49.9195166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9195264Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9195339Z         self,
2025-05-07T20:31:49.9195414Z         T: int,
2025-05-07T20:31:49.9195492Z         D: int,
2025-05-07T20:31:49.9195588Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9195676Z         contiguous: bool,
2025-05-07T20:31:49.9195766Z         compiled: bool,
2025-05-07T20:31:49.9195843Z     ) -> None:
2025-05-07T20:31:49.9195952Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9196024Z     
2025-05-07T20:31:49.9196192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9196270Z     
2025-05-07T20:31:49.9196359Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9196483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9196573Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9196653Z         x0 = x[:, :D]
2025-05-07T20:31:49.9196731Z         x1 = x[:, D:]
2025-05-07T20:31:49.9196805Z     
2025-05-07T20:31:49.9196887Z         if contiguous:
2025-05-07T20:31:49.9196978Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9197070Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9197143Z     
2025-05-07T20:31:49.9197239Z         if scale_ub is not None:
2025-05-07T20:31:49.9197341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9197480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9197559Z             )
2025-05-07T20:31:49.9197717Z         else:
2025-05-07T20:31:49.9197811Z             scale_ub_tensor = None
2025-05-07T20:31:49.9197885Z     
2025-05-07T20:31:49.9198016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9198104Z             op = silu_mul_quant
2025-05-07T20:31:49.9198191Z             if compiled:
2025-05-07T20:31:49.9198291Z                 op = torch.compile(op)
2025-05-07T20:31:49.9198394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9198467Z     
2025-05-07T20:31:49.9198558Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9198562Z 
2025-05-07T20:31:49.9198662Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9198791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9198892Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9198993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9199490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9199591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9199950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9200178Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9200519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9200612Z     kernel = self.compile(
2025-05-07T20:31:49.9200987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9201164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9201288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9201292Z 
2025-05-07T20:31:49.9201498Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd90d1730>
2025-05-07T20:31:49.9202356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9202860Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8467dc0>}
2025-05-07T20:31:49.9203609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9203799Z context = <triton._C.libtriton.ir.context object at 0x7f2fd73fe530>
2025-05-07T20:31:49.9203803Z 
2025-05-07T20:31:49.9203975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9204248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9204359Z                            module_map=module_map)
2025-05-07T20:31:49.9204527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9204624Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9204700Z E       ^
2025-05-07T20:31:49.9205055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9205059Z 
2025-05-07T20:31:49.9205475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9205480Z 
2025-05-07T20:31:49.9205583Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9205803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9205878Z     T=128,
2025-05-07T20:31:49.9205955Z     D=5120,
2025-05-07T20:31:49.9206038Z     scale_ub=None,
2025-05-07T20:31:49.9206206Z     contiguous=False,
2025-05-07T20:31:49.9206290Z     compiled=False,
2025-05-07T20:31:49.9206362Z )
2025-05-07T20:31:49.9206583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9206753Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9206758Z 
2025-05-07T20:31:49.9206833Z     @given(
2025-05-07T20:31:49.9206953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9207050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9207162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9207281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9207393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9207469Z     )
2025-05-07T20:31:49.9207713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9207806Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9207883Z         self,
2025-05-07T20:31:49.9207971Z         T: int,
2025-05-07T20:31:49.9208048Z         D: int,
2025-05-07T20:31:49.9208149Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9208240Z         contiguous: bool,
2025-05-07T20:31:49.9208324Z         compiled: bool,
2025-05-07T20:31:49.9208406Z     ) -> None:
2025-05-07T20:31:49.9208504Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9208574Z     
2025-05-07T20:31:49.9208742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9208815Z     
2025-05-07T20:31:49.9208909Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9209031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9209119Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9209203Z         x0 = x[:, :D]
2025-05-07T20:31:49.9209281Z         x1 = x[:, D:]
2025-05-07T20:31:49.9209352Z     
2025-05-07T20:31:49.9209438Z         if contiguous:
2025-05-07T20:31:49.9209529Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9209704Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9209779Z     
2025-05-07T20:31:49.9209868Z         if scale_ub is not None:
2025-05-07T20:31:49.9209972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9210113Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9210188Z             )
2025-05-07T20:31:49.9210264Z         else:
2025-05-07T20:31:49.9210361Z             scale_ub_tensor = None
2025-05-07T20:31:49.9210433Z     
2025-05-07T20:31:49.9210564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9210652Z             op = silu_mul_quant
2025-05-07T20:31:49.9210736Z             if compiled:
2025-05-07T20:31:49.9210836Z                 op = torch.compile(op)
2025-05-07T20:31:49.9210940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9211010Z     
2025-05-07T20:31:49.9211107Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9211111Z 
2025-05-07T20:31:49.9211205Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9211341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9211445Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9211546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9212045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9212140Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9212501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9212727Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9213064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9213156Z     kernel = self.compile(
2025-05-07T20:31:49.9213535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9213797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9213934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9213940Z 
2025-05-07T20:31:49.9214178Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7e04220>
2025-05-07T20:31:49.9214949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9215449Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7d7b3a0>}
2025-05-07T20:31:49.9216190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9216387Z context = <triton._C.libtriton.ir.context object at 0x7f2fd75bc070>
2025-05-07T20:31:49.9216391Z 
2025-05-07T20:31:49.9216556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9216823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9216930Z                            module_map=module_map)
2025-05-07T20:31:49.9217094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9217193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9217269Z E       ^
2025-05-07T20:31:49.9217624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9217629Z 
2025-05-07T20:31:49.9218040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9218142Z 
2025-05-07T20:31:49.9218251Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9218474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9218552Z     T=128,
2025-05-07T20:31:49.9218629Z     D=5120,
2025-05-07T20:31:49.9218713Z     scale_ub=1200.0,
2025-05-07T20:31:49.9218797Z     contiguous=True,
2025-05-07T20:31:49.9218879Z     compiled=False,
2025-05-07T20:31:49.9218961Z )
2025-05-07T20:31:49.9219179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9219346Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9219350Z 
2025-05-07T20:31:49.9219433Z     @given(
2025-05-07T20:31:49.9224197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9224318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9224437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9224554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9224683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9224757Z     )
2025-05-07T20:31:49.9225008Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9225102Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9225179Z         self,
2025-05-07T20:31:49.9225258Z         T: int,
2025-05-07T20:31:49.9225335Z         D: int,
2025-05-07T20:31:49.9225439Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9225530Z         contiguous: bool,
2025-05-07T20:31:49.9225616Z         compiled: bool,
2025-05-07T20:31:49.9225696Z     ) -> None:
2025-05-07T20:31:49.9225790Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9225863Z     
2025-05-07T20:31:49.9226042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9226115Z     
2025-05-07T20:31:49.9226207Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9226336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9226533Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9226616Z         x0 = x[:, :D]
2025-05-07T20:31:49.9226698Z         x1 = x[:, D:]
2025-05-07T20:31:49.9226771Z     
2025-05-07T20:31:49.9226861Z         if contiguous:
2025-05-07T20:31:49.9226951Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9227041Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9227114Z     
2025-05-07T20:31:49.9227206Z         if scale_ub is not None:
2025-05-07T20:31:49.9227310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9227447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9227522Z             )
2025-05-07T20:31:49.9227598Z         else:
2025-05-07T20:31:49.9227693Z             scale_ub_tensor = None
2025-05-07T20:31:49.9227765Z     
2025-05-07T20:31:49.9227894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9227987Z             op = silu_mul_quant
2025-05-07T20:31:49.9228070Z             if compiled:
2025-05-07T20:31:49.9228189Z                 op = torch.compile(op)
2025-05-07T20:31:49.9228295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9228367Z     
2025-05-07T20:31:49.9228464Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9228469Z 
2025-05-07T20:31:49.9228566Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9228695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9228799Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9228901Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9229412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9229514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9229871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9230101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9230533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9230630Z     kernel = self.compile(
2025-05-07T20:31:49.9231012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9231190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9231320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9231324Z 
2025-05-07T20:31:49.9231530Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd75af970>
2025-05-07T20:31:49.9232294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9232800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8303c10>}
2025-05-07T20:31:49.9233548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9233747Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7766970>
2025-05-07T20:31:49.9233751Z 
2025-05-07T20:31:49.9233919Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9234182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9234294Z                            module_map=module_map)
2025-05-07T20:31:49.9234464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9234567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9234643Z E       ^
2025-05-07T20:31:49.9235080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9235085Z 
2025-05-07T20:31:49.9235505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9235509Z 
2025-05-07T20:31:49.9235609Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9235832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9235908Z     T=1,
2025-05-07T20:31:49.9235983Z     D=7168,
2025-05-07T20:31:49.9236067Z     scale_ub=1200.0,
2025-05-07T20:31:49.9236153Z     contiguous=True,
2025-05-07T20:31:49.9236235Z     compiled=True,
2025-05-07T20:31:49.9236309Z )
2025-05-07T20:31:49.9236525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9236688Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9236692Z 
2025-05-07T20:31:49.9236781Z     @given(
2025-05-07T20:31:49.9236899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9236998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9237113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9237229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9237342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9237419Z     )
2025-05-07T20:31:49.9237664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9237759Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9237834Z         self,
2025-05-07T20:31:49.9237910Z         T: int,
2025-05-07T20:31:49.9237988Z         D: int,
2025-05-07T20:31:49.9238086Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9238175Z         contiguous: bool,
2025-05-07T20:31:49.9238263Z         compiled: bool,
2025-05-07T20:31:49.9238343Z     ) -> None:
2025-05-07T20:31:49.9238436Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9238600Z     
2025-05-07T20:31:49.9238771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9238848Z     
2025-05-07T20:31:49.9238939Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9239062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9239153Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9239234Z         x0 = x[:, :D]
2025-05-07T20:31:49.9239316Z         x1 = x[:, D:]
2025-05-07T20:31:49.9239390Z     
2025-05-07T20:31:49.9239472Z         if contiguous:
2025-05-07T20:31:49.9239563Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9239654Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9239725Z     
2025-05-07T20:31:49.9239814Z         if scale_ub is not None:
2025-05-07T20:31:49.9239922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9240276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9240398Z             )
2025-05-07T20:31:49.9240524Z         else:
2025-05-07T20:31:49.9240660Z             scale_ub_tensor = None
2025-05-07T20:31:49.9240770Z     
2025-05-07T20:31:49.9240963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9241086Z             op = silu_mul_quant
2025-05-07T20:31:49.9241192Z             if compiled:
2025-05-07T20:31:49.9241292Z                 op = torch.compile(op)
2025-05-07T20:31:49.9241397Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9241476Z     
2025-05-07T20:31:49.9241569Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9241574Z 
2025-05-07T20:31:49.9241671Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9241801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9241902Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9242004Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9242373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9242612Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9243118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9243214Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9243571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9243796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9244130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9244225Z     kernel = self.compile(
2025-05-07T20:31:49.9244607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9244782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9244915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9244924Z 
2025-05-07T20:31:49.9245128Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd75e2d00>
2025-05-07T20:31:49.9245893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9246401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd83601f0>}
2025-05-07T20:31:49.9247144Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9247334Z context = <triton._C.libtriton.ir.context object at 0x7f2fd74b1470>
2025-05-07T20:31:49.9247455Z 
2025-05-07T20:31:49.9247623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9247888Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9247998Z                            module_map=module_map)
2025-05-07T20:31:49.9248161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9248261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9248336Z E       ^
2025-05-07T20:31:49.9248688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9248693Z 
2025-05-07T20:31:49.9249103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9249107Z 
2025-05-07T20:31:49.9249207Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9249442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9249523Z     T=1,
2025-05-07T20:31:49.9249603Z     D=7168,
2025-05-07T20:31:49.9249688Z     scale_ub=1200.0,
2025-05-07T20:31:49.9249772Z     contiguous=False,
2025-05-07T20:31:49.9249858Z     compiled=True,
2025-05-07T20:31:49.9249932Z )
2025-05-07T20:31:49.9250151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9250320Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9250325Z 
2025-05-07T20:31:49.9250400Z     @given(
2025-05-07T20:31:49.9250517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9250621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9250734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9250851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9250967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9251040Z     )
2025-05-07T20:31:49.9251398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9251492Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9251568Z         self,
2025-05-07T20:31:49.9251645Z         T: int,
2025-05-07T20:31:49.9251722Z         D: int,
2025-05-07T20:31:49.9251819Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9251915Z         contiguous: bool,
2025-05-07T20:31:49.9251999Z         compiled: bool,
2025-05-07T20:31:49.9252075Z     ) -> None:
2025-05-07T20:31:49.9252172Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9252242Z     
2025-05-07T20:31:49.9252410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9252486Z     
2025-05-07T20:31:49.9252579Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9252706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9252796Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9252878Z         x0 = x[:, :D]
2025-05-07T20:31:49.9252959Z         x1 = x[:, D:]
2025-05-07T20:31:49.9253040Z     
2025-05-07T20:31:49.9253123Z         if contiguous:
2025-05-07T20:31:49.9253219Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9253307Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9253380Z     
2025-05-07T20:31:49.9253475Z         if scale_ub is not None:
2025-05-07T20:31:49.9253580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9253716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9253798Z             )
2025-05-07T20:31:49.9253875Z         else:
2025-05-07T20:31:49.9253970Z             scale_ub_tensor = None
2025-05-07T20:31:49.9254042Z     
2025-05-07T20:31:49.9254173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9254264Z             op = silu_mul_quant
2025-05-07T20:31:49.9254347Z             if compiled:
2025-05-07T20:31:49.9254446Z                 op = torch.compile(op)
2025-05-07T20:31:49.9254551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9254708Z     
2025-05-07T20:31:49.9254804Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9254808Z 
2025-05-07T20:31:49.9254909Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9255035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9255139Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9255241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9255610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9255711Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9256202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9256297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9256654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9256879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9257228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9257320Z     kernel = self.compile(
2025-05-07T20:31:49.9257697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9257875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9257999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9258004Z 
2025-05-07T20:31:49.9258208Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd74c1a90>
2025-05-07T20:31:49.9258978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9259553Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd8f718b0>}
2025-05-07T20:31:49.9260290Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9260483Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7859070>
2025-05-07T20:31:49.9260487Z 
2025-05-07T20:31:49.9260651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9260922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9261028Z                            module_map=module_map)
2025-05-07T20:31:49.9261263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9261364Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9261451Z E       ^
2025-05-07T20:31:49.9261806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9261810Z 
2025-05-07T20:31:49.9262223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9262227Z 
2025-05-07T20:31:49.9262327Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9262553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9262628Z     T=1,
2025-05-07T20:31:49.9262703Z     D=7168,
2025-05-07T20:31:49.9262786Z     scale_ub=None,
2025-05-07T20:31:49.9262870Z     contiguous=False,
2025-05-07T20:31:49.9262955Z     compiled=True,
2025-05-07T20:31:49.9263028Z )
2025-05-07T20:31:49.9263245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9263409Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9263500Z 
2025-05-07T20:31:49.9263576Z     @given(
2025-05-07T20:31:49.9263694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9263797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9263910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9264025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9264141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9264216Z     )
2025-05-07T20:31:49.9264466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9264558Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9264634Z         self,
2025-05-07T20:31:49.9264712Z         T: int,
2025-05-07T20:31:49.9264789Z         D: int,
2025-05-07T20:31:49.9264888Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9264977Z         contiguous: bool,
2025-05-07T20:31:49.9265063Z         compiled: bool,
2025-05-07T20:31:49.9265139Z     ) -> None:
2025-05-07T20:31:49.9265248Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9265320Z     
2025-05-07T20:31:49.9265490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9265564Z     
2025-05-07T20:31:49.9265654Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9265782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9265871Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9265951Z         x0 = x[:, :D]
2025-05-07T20:31:49.9266036Z         x1 = x[:, D:]
2025-05-07T20:31:49.9266107Z     
2025-05-07T20:31:49.9266191Z         if contiguous:
2025-05-07T20:31:49.9266287Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9266374Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9266448Z     
2025-05-07T20:31:49.9266541Z         if scale_ub is not None:
2025-05-07T20:31:49.9266645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9266779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9266941Z             )
2025-05-07T20:31:49.9267019Z         else:
2025-05-07T20:31:49.9267111Z             scale_ub_tensor = None
2025-05-07T20:31:49.9267187Z     
2025-05-07T20:31:49.9267318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9267411Z             op = silu_mul_quant
2025-05-07T20:31:49.9267496Z             if compiled:
2025-05-07T20:31:49.9267594Z                 op = torch.compile(op)
2025-05-07T20:31:49.9267704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9267778Z     
2025-05-07T20:31:49.9267867Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.9267989Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.9268063Z     
2025-05-07T20:31:49.9268196Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9268303Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.9268403Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.9268534Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.9268689Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9268766Z     
2025-05-07T20:31:49.9268868Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.9268872Z 
2025-05-07T20:31:49.9268969Z moe/activation_test.py:126: 
2025-05-07T20:31:49.9269097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9269203Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.9269335Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.9269889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.9269992Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.9270349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9270579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9271028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.9271283Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9271682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.9271934Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.9272315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.9272482Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.9272828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.9272907Z     fn()
2025-05-07T20:31:49.9273313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.9273401Z     self.fn.run(
2025-05-07T20:31:49.9273743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9273837Z     kernel = self.compile(
2025-05-07T20:31:49.9274218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9274394Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9274522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9274527Z 
2025-05-07T20:31:49.9274742Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7884ca0>
2025-05-07T20:31:49.9275587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9276105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7721430>}
2025-05-07T20:31:49.9276842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9277035Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77388b0>
2025-05-07T20:31:49.9277040Z 
2025-05-07T20:31:49.9277205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9277464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9277573Z                            module_map=module_map)
2025-05-07T20:31:49.9277739Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9277848Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.9277927Z E       ^
2025-05-07T20:31:49.9278276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9278280Z 
2025-05-07T20:31:49.9278693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9278697Z 
2025-05-07T20:31:49.9278798Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9279019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9279098Z     T=1,
2025-05-07T20:31:49.9279173Z     D=5120,
2025-05-07T20:31:49.9279254Z     scale_ub=1200.0,
2025-05-07T20:31:49.9279344Z     contiguous=False,
2025-05-07T20:31:49.9279425Z     compiled=True,
2025-05-07T20:31:49.9279498Z )
2025-05-07T20:31:49.9279718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9279972Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9279976Z 
2025-05-07T20:31:49.9280054Z     @given(
2025-05-07T20:31:49.9280171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9280268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9280384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9280499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9280611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9280693Z     )
2025-05-07T20:31:49.9280938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9281034Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9281109Z         self,
2025-05-07T20:31:49.9281184Z         T: int,
2025-05-07T20:31:49.9281263Z         D: int,
2025-05-07T20:31:49.9281360Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9281451Z         contiguous: bool,
2025-05-07T20:31:49.9281548Z         compiled: bool,
2025-05-07T20:31:49.9281625Z     ) -> None:
2025-05-07T20:31:49.9281718Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9281793Z     
2025-05-07T20:31:49.9281960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9282032Z     
2025-05-07T20:31:49.9282125Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9282249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9282340Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9282420Z         x0 = x[:, :D]
2025-05-07T20:31:49.9282501Z         x1 = x[:, D:]
2025-05-07T20:31:49.9282575Z     
2025-05-07T20:31:49.9282657Z         if contiguous:
2025-05-07T20:31:49.9282749Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9282843Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9282916Z     
2025-05-07T20:31:49.9283007Z         if scale_ub is not None:
2025-05-07T20:31:49.9283116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9283364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9283443Z             )
2025-05-07T20:31:49.9283523Z         else:
2025-05-07T20:31:49.9283614Z             scale_ub_tensor = None
2025-05-07T20:31:49.9283685Z     
2025-05-07T20:31:49.9283819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9283908Z             op = silu_mul_quant
2025-05-07T20:31:49.9283998Z             if compiled:
2025-05-07T20:31:49.9284097Z                 op = torch.compile(op)
2025-05-07T20:31:49.9284201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9284279Z     
2025-05-07T20:31:49.9284376Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9284380Z 
2025-05-07T20:31:49.9284477Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9284606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9284709Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9284807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9285197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9285288Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9285778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9285873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9286228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9286452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9286788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9286886Z     kernel = self.compile(
2025-05-07T20:31:49.9287268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9287533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9287660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9287665Z 
2025-05-07T20:31:49.9287874Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7700580>
2025-05-07T20:31:49.9288638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9289136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7721e50>}
2025-05-07T20:31:49.9289878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9290078Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77081f0>
2025-05-07T20:31:49.9290083Z 
2025-05-07T20:31:49.9290246Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9290515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9290621Z                            module_map=module_map)
2025-05-07T20:31:49.9290782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9290884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9290967Z E       ^
2025-05-07T20:31:49.9291321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9291328Z 
2025-05-07T20:31:49.9291737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9291749Z 
2025-05-07T20:31:49.9291928Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9292163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9292242Z     T=1,
2025-05-07T20:31:49.9292317Z     D=5120,
2025-05-07T20:31:49.9292400Z     scale_ub=1200.0,
2025-05-07T20:31:49.9292487Z     contiguous=False,
2025-05-07T20:31:49.9292570Z     compiled=False,
2025-05-07T20:31:49.9292644Z )
2025-05-07T20:31:49.9292860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9293029Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9293034Z 
2025-05-07T20:31:49.9293109Z     @given(
2025-05-07T20:31:49.9293225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9293328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9293441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9293559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9293684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9293756Z     )
2025-05-07T20:31:49.9294001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9294096Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9294171Z         self,
2025-05-07T20:31:49.9294253Z         T: int,
2025-05-07T20:31:49.9294332Z         D: int,
2025-05-07T20:31:49.9294429Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9294519Z         contiguous: bool,
2025-05-07T20:31:49.9294606Z         compiled: bool,
2025-05-07T20:31:49.9294686Z     ) -> None:
2025-05-07T20:31:49.9294782Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9294857Z     
2025-05-07T20:31:49.9295027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9295105Z     
2025-05-07T20:31:49.9295197Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9295321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9295543Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9295624Z         x0 = x[:, :D]
2025-05-07T20:31:49.9295712Z         x1 = x[:, D:]
2025-05-07T20:31:49.9295787Z     
2025-05-07T20:31:49.9295870Z         if contiguous:
2025-05-07T20:31:49.9295972Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9296062Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9296137Z     
2025-05-07T20:31:49.9296236Z         if scale_ub is not None:
2025-05-07T20:31:49.9296340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9296476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9296556Z             )
2025-05-07T20:31:49.9296635Z         else:
2025-05-07T20:31:49.9296729Z             scale_ub_tensor = None
2025-05-07T20:31:49.9296804Z     
2025-05-07T20:31:49.9296934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9297023Z             op = silu_mul_quant
2025-05-07T20:31:49.9297114Z             if compiled:
2025-05-07T20:31:49.9297227Z                 op = torch.compile(op)
2025-05-07T20:31:49.9297334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9297406Z     
2025-05-07T20:31:49.9297496Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9297500Z 
2025-05-07T20:31:49.9297599Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9297726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9297825Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9297930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9298424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9298528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9298883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9299104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9299526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9299619Z     kernel = self.compile(
2025-05-07T20:31:49.9300000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9300180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9300304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9300309Z 
2025-05-07T20:31:49.9300518Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd76e6e50>
2025-05-07T20:31:49.9301340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9301843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76e7820>}
2025-05-07T20:31:49.9302596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9302788Z context = <triton._C.libtriton.ir.context object at 0x7f2fd70fa1f0>
2025-05-07T20:31:49.9302793Z 
2025-05-07T20:31:49.9302963Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9303224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9303330Z                            module_map=module_map)
2025-05-07T20:31:49.9303494Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9303591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9303672Z E       ^
2025-05-07T20:31:49.9304106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9304110Z 
2025-05-07T20:31:49.9304525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9304530Z 
2025-05-07T20:31:49.9304639Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9304864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9304943Z     T=16384,
2025-05-07T20:31:49.9305021Z     D=5120,
2025-05-07T20:31:49.9305103Z     scale_ub=1200.0,
2025-05-07T20:31:49.9305190Z     contiguous=False,
2025-05-07T20:31:49.9305273Z     compiled=True,
2025-05-07T20:31:49.9305349Z )
2025-05-07T20:31:49.9305568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9305744Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9305754Z 
2025-05-07T20:31:49.9305837Z     @given(
2025-05-07T20:31:49.9305961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9306058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9306172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9306288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9306399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9306479Z     )
2025-05-07T20:31:49.9306723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9306816Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9306896Z         self,
2025-05-07T20:31:49.9306971Z         T: int,
2025-05-07T20:31:49.9307047Z         D: int,
2025-05-07T20:31:49.9307147Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9307234Z         contiguous: bool,
2025-05-07T20:31:49.9307319Z         compiled: bool,
2025-05-07T20:31:49.9307399Z     ) -> None:
2025-05-07T20:31:49.9307568Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9307652Z     
2025-05-07T20:31:49.9307820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9307893Z     
2025-05-07T20:31:49.9307987Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9308115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9308204Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9308295Z         x0 = x[:, :D]
2025-05-07T20:31:49.9308377Z         x1 = x[:, D:]
2025-05-07T20:31:49.9308449Z     
2025-05-07T20:31:49.9308533Z         if contiguous:
2025-05-07T20:31:49.9308624Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9308713Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9308786Z     
2025-05-07T20:31:49.9308875Z         if scale_ub is not None:
2025-05-07T20:31:49.9308980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9309121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9309196Z             )
2025-05-07T20:31:49.9309285Z         else:
2025-05-07T20:31:49.9309378Z             scale_ub_tensor = None
2025-05-07T20:31:49.9309451Z     
2025-05-07T20:31:49.9309584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9309672Z             op = silu_mul_quant
2025-05-07T20:31:49.9309759Z             if compiled:
2025-05-07T20:31:49.9309861Z                 op = torch.compile(op)
2025-05-07T20:31:49.9309965Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9310040Z     
2025-05-07T20:31:49.9310138Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9310143Z 
2025-05-07T20:31:49.9310238Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9310375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9310474Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9310572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9310941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9311149Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9311643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9311743Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9312099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9312323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9312658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9312751Z     kernel = self.compile(
2025-05-07T20:31:49.9313130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9313315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9313455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9313463Z 
2025-05-07T20:31:49.9313671Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd70f2f10>
2025-05-07T20:31:49.9314432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9314936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd70b6790>}
2025-05-07T20:31:49.9315672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9315868Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7128930>
2025-05-07T20:31:49.9315953Z 
2025-05-07T20:31:49.9316123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9316389Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9316498Z                            module_map=module_map)
2025-05-07T20:31:49.9316660Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9316762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9316837Z E       ^
2025-05-07T20:31:49.9317187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9317191Z 
2025-05-07T20:31:49.9317605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9317610Z 
2025-05-07T20:31:49.9317714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9317943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9318034Z     T=2048,
2025-05-07T20:31:49.9318109Z     D=7168,
2025-05-07T20:31:49.9318193Z     scale_ub=1200.0,
2025-05-07T20:31:49.9318277Z     contiguous=False,
2025-05-07T20:31:49.9318358Z     compiled=True,
2025-05-07T20:31:49.9318433Z )
2025-05-07T20:31:49.9318648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9318820Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9318824Z 
2025-05-07T20:31:49.9318904Z     @given(
2025-05-07T20:31:49.9319026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9319125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9319242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9319364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9319475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9319552Z     )
2025-05-07T20:31:49.9319881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9319975Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9320051Z         self,
2025-05-07T20:31:49.9320126Z         T: int,
2025-05-07T20:31:49.9320209Z         D: int,
2025-05-07T20:31:49.9320306Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9320396Z         contiguous: bool,
2025-05-07T20:31:49.9320483Z         compiled: bool,
2025-05-07T20:31:49.9320561Z     ) -> None:
2025-05-07T20:31:49.9320657Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9320734Z     
2025-05-07T20:31:49.9320902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9320975Z     
2025-05-07T20:31:49.9321069Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9321194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9321285Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9321364Z         x0 = x[:, :D]
2025-05-07T20:31:49.9321449Z         x1 = x[:, D:]
2025-05-07T20:31:49.9321533Z     
2025-05-07T20:31:49.9321616Z         if contiguous:
2025-05-07T20:31:49.9321706Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9321804Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9321875Z     
2025-05-07T20:31:49.9321966Z         if scale_ub is not None:
2025-05-07T20:31:49.9322074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9322213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9322287Z             )
2025-05-07T20:31:49.9322371Z         else:
2025-05-07T20:31:49.9322463Z             scale_ub_tensor = None
2025-05-07T20:31:49.9322539Z     
2025-05-07T20:31:49.9322667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9322758Z             op = silu_mul_quant
2025-05-07T20:31:49.9322845Z             if compiled:
2025-05-07T20:31:49.9322943Z                 op = torch.compile(op)
2025-05-07T20:31:49.9323049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9323210Z     
2025-05-07T20:31:49.9323303Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9323309Z 
2025-05-07T20:31:49.9323407Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9323535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9323634Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9323735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9324097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9324190Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9324683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9324781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9325134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9325365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9325709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9325804Z     kernel = self.compile(
2025-05-07T20:31:49.9326179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9326353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9326481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9326487Z 
2025-05-07T20:31:49.9326693Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7117d00>
2025-05-07T20:31:49.9327459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9328051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76bc4c0>}
2025-05-07T20:31:49.9328797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9328992Z context = <triton._C.libtriton.ir.context object at 0x7f2fd76bfe70>
2025-05-07T20:31:49.9328997Z 
2025-05-07T20:31:49.9329160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9329425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9329533Z                            module_map=module_map)
2025-05-07T20:31:49.9329696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9329801Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9329890Z E       ^
2025-05-07T20:31:49.9330239Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9330667Z 
2025-05-07T20:31:49.9331083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9331087Z 
2025-05-07T20:31:49.9331191Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9331420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9331495Z     T=1,
2025-05-07T20:31:49.9331570Z     D=5120,
2025-05-07T20:31:49.9331654Z     scale_ub=None,
2025-05-07T20:31:49.9331738Z     contiguous=False,
2025-05-07T20:31:49.9331824Z     compiled=False,
2025-05-07T20:31:49.9331896Z )
2025-05-07T20:31:49.9332110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9332357Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9332368Z 
2025-05-07T20:31:49.9332444Z     @given(
2025-05-07T20:31:49.9332562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9332664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9332778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9332894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9333014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9333090Z     )
2025-05-07T20:31:49.9333336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9333428Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9333505Z         self,
2025-05-07T20:31:49.9333582Z         T: int,
2025-05-07T20:31:49.9333657Z         D: int,
2025-05-07T20:31:49.9333754Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9333845Z         contiguous: bool,
2025-05-07T20:31:49.9333930Z         compiled: bool,
2025-05-07T20:31:49.9334010Z     ) -> None:
2025-05-07T20:31:49.9334113Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9334187Z     
2025-05-07T20:31:49.9334355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9334432Z     
2025-05-07T20:31:49.9334521Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9334650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9334737Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9334816Z         x0 = x[:, :D]
2025-05-07T20:31:49.9334899Z         x1 = x[:, D:]
2025-05-07T20:31:49.9334971Z     
2025-05-07T20:31:49.9335053Z         if contiguous:
2025-05-07T20:31:49.9335145Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9335233Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9335305Z     
2025-05-07T20:31:49.9335398Z         if scale_ub is not None:
2025-05-07T20:31:49.9335502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9335638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9335809Z             )
2025-05-07T20:31:49.9335886Z         else:
2025-05-07T20:31:49.9335977Z             scale_ub_tensor = None
2025-05-07T20:31:49.9336053Z     
2025-05-07T20:31:49.9336184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9336277Z             op = silu_mul_quant
2025-05-07T20:31:49.9336361Z             if compiled:
2025-05-07T20:31:49.9336460Z                 op = torch.compile(op)
2025-05-07T20:31:49.9336566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9336637Z     
2025-05-07T20:31:49.9336727Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9336731Z 
2025-05-07T20:31:49.9336829Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9336956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9337056Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9337158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9337653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9337759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9338114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9338339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9338678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9338772Z     kernel = self.compile(
2025-05-07T20:31:49.9339154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9339329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9339455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9339459Z 
2025-05-07T20:31:49.9339745Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd784a670>
2025-05-07T20:31:49.9340767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9341319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd76bc820>}
2025-05-07T20:31:49.9342065Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9342257Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7475e30>
2025-05-07T20:31:49.9342261Z 
2025-05-07T20:31:49.9342434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9342707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9342817Z                            module_map=module_map)
2025-05-07T20:31:49.9342981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9343078Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9343155Z E       ^
2025-05-07T20:31:49.9343522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9343527Z 
2025-05-07T20:31:49.9343936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9343941Z 
2025-05-07T20:31:49.9344048Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9348984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9349077Z     T=4096,
2025-05-07T20:31:49.9349154Z     D=7168,
2025-05-07T20:31:49.9349434Z     scale_ub=1200.0,
2025-05-07T20:31:49.9349524Z     contiguous=False,
2025-05-07T20:31:49.9349608Z     compiled=False,
2025-05-07T20:31:49.9349683Z )
2025-05-07T20:31:49.9349906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9350084Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9350092Z 
2025-05-07T20:31:49.9350169Z     @given(
2025-05-07T20:31:49.9350290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9350395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9350511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9350626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9350741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9350816Z     )
2025-05-07T20:31:49.9351065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9351165Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9351251Z         self,
2025-05-07T20:31:49.9351329Z         T: int,
2025-05-07T20:31:49.9351409Z         D: int,
2025-05-07T20:31:49.9351508Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9351600Z         contiguous: bool,
2025-05-07T20:31:49.9351685Z         compiled: bool,
2025-05-07T20:31:49.9351766Z     ) -> None:
2025-05-07T20:31:49.9351865Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9351938Z     
2025-05-07T20:31:49.9352110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9352185Z     
2025-05-07T20:31:49.9352280Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9352404Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9352495Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9352576Z         x0 = x[:, :D]
2025-05-07T20:31:49.9352654Z         x1 = x[:, D:]
2025-05-07T20:31:49.9352729Z     
2025-05-07T20:31:49.9352811Z         if contiguous:
2025-05-07T20:31:49.9353017Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9353117Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9353189Z     
2025-05-07T20:31:49.9353281Z         if scale_ub is not None:
2025-05-07T20:31:49.9353387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9353523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9353605Z             )
2025-05-07T20:31:49.9353682Z         else:
2025-05-07T20:31:49.9353777Z             scale_ub_tensor = None
2025-05-07T20:31:49.9353853Z     
2025-05-07T20:31:49.9353984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9354075Z             op = silu_mul_quant
2025-05-07T20:31:49.9354169Z             if compiled:
2025-05-07T20:31:49.9354268Z                 op = torch.compile(op)
2025-05-07T20:31:49.9354376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9354448Z     
2025-05-07T20:31:49.9354541Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9354546Z 
2025-05-07T20:31:49.9354659Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9354789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9354891Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9354993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9355501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9355598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9355960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9356183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9356533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9356626Z     kernel = self.compile(
2025-05-07T20:31:49.9357011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9357280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9357405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9357410Z 
2025-05-07T20:31:49.9357626Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd747f760>
2025-05-07T20:31:49.9358394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9358890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7466af0>}
2025-05-07T20:31:49.9359650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9359848Z context = <triton._C.libtriton.ir.context object at 0x7f2fd77c2630>
2025-05-07T20:31:49.9359852Z 
2025-05-07T20:31:49.9360020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9360281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9360388Z                            module_map=module_map)
2025-05-07T20:31:49.9360554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9360651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9360730Z E       ^
2025-05-07T20:31:49.9361081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9361085Z 
2025-05-07T20:31:49.9361572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9361583Z 
2025-05-07T20:31:49.9361690Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9361911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9361991Z     T=16384,
2025-05-07T20:31:49.9362066Z     D=7168,
2025-05-07T20:31:49.9362146Z     scale_ub=None,
2025-05-07T20:31:49.9362234Z     contiguous=True,
2025-05-07T20:31:49.9362317Z     compiled=True,
2025-05-07T20:31:49.9362390Z )
2025-05-07T20:31:49.9362607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9362780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9362785Z 
2025-05-07T20:31:49.9362861Z     @given(
2025-05-07T20:31:49.9362981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9363081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9363195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9363324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9363443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9363518Z     )
2025-05-07T20:31:49.9363765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9363860Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9363940Z         self,
2025-05-07T20:31:49.9364015Z         T: int,
2025-05-07T20:31:49.9364091Z         D: int,
2025-05-07T20:31:49.9364191Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9364281Z         contiguous: bool,
2025-05-07T20:31:49.9364367Z         compiled: bool,
2025-05-07T20:31:49.9364446Z     ) -> None:
2025-05-07T20:31:49.9364539Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9364610Z     
2025-05-07T20:31:49.9364780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9364853Z     
2025-05-07T20:31:49.9364946Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9365156Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9365245Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9365327Z         x0 = x[:, :D]
2025-05-07T20:31:49.9365406Z         x1 = x[:, D:]
2025-05-07T20:31:49.9365479Z     
2025-05-07T20:31:49.9365563Z         if contiguous:
2025-05-07T20:31:49.9365657Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9365745Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9365822Z     
2025-05-07T20:31:49.9365912Z         if scale_ub is not None:
2025-05-07T20:31:49.9366019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9366164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9366242Z             )
2025-05-07T20:31:49.9366321Z         else:
2025-05-07T20:31:49.9366416Z             scale_ub_tensor = None
2025-05-07T20:31:49.9366489Z     
2025-05-07T20:31:49.9366624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9366714Z             op = silu_mul_quant
2025-05-07T20:31:49.9366808Z             if compiled:
2025-05-07T20:31:49.9366911Z                 op = torch.compile(op)
2025-05-07T20:31:49.9367016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9367088Z     
2025-05-07T20:31:49.9367181Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9367185Z 
2025-05-07T20:31:49.9367280Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9367407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9367509Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9367609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9367974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9368067Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9368557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9368655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9369090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9369323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9369658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9369753Z     kernel = self.compile(
2025-05-07T20:31:49.9370134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9370312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9370438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9370442Z 
2025-05-07T20:31:49.9370653Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd734c820>
2025-05-07T20:31:49.9371420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9371928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd764b790>}
2025-05-07T20:31:49.9372673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9372871Z context = <triton._C.libtriton.ir.context object at 0x7f2fd761beb0>
2025-05-07T20:31:49.9372876Z 
2025-05-07T20:31:49.9373040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9373309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9373496Z                            module_map=module_map)
2025-05-07T20:31:49.9373661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9373758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9373840Z E       ^
2025-05-07T20:31:49.9374198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9374202Z 
2025-05-07T20:31:49.9374623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9374627Z 
2025-05-07T20:31:49.9374728Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9374949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9375032Z     T=4096,
2025-05-07T20:31:49.9375107Z     D=5120,
2025-05-07T20:31:49.9375189Z     scale_ub=None,
2025-05-07T20:31:49.9375282Z     contiguous=False,
2025-05-07T20:31:49.9375364Z     compiled=True,
2025-05-07T20:31:49.9375445Z )
2025-05-07T20:31:49.9375665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9375839Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9375844Z 
2025-05-07T20:31:49.9375922Z     @given(
2025-05-07T20:31:49.9376039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9376137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9376258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9376372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9376484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9376561Z     )
2025-05-07T20:31:49.9376806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9376901Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9376977Z         self,
2025-05-07T20:31:49.9377054Z         T: int,
2025-05-07T20:31:49.9377135Z         D: int,
2025-05-07T20:31:49.9377334Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9377426Z         contiguous: bool,
2025-05-07T20:31:49.9377513Z         compiled: bool,
2025-05-07T20:31:49.9377590Z     ) -> None:
2025-05-07T20:31:49.9377685Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9377760Z     
2025-05-07T20:31:49.9377926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9378004Z     
2025-05-07T20:31:49.9378099Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9378223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9378316Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9378399Z         x0 = x[:, :D]
2025-05-07T20:31:49.9378477Z         x1 = x[:, D:]
2025-05-07T20:31:49.9378554Z     
2025-05-07T20:31:49.9378636Z         if contiguous:
2025-05-07T20:31:49.9378730Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9378820Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9378892Z     
2025-05-07T20:31:49.9378994Z         if scale_ub is not None:
2025-05-07T20:31:49.9379100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9379239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9379315Z             )
2025-05-07T20:31:49.9379398Z         else:
2025-05-07T20:31:49.9379492Z             scale_ub_tensor = None
2025-05-07T20:31:49.9379569Z     
2025-05-07T20:31:49.9379701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9379789Z             op = silu_mul_quant
2025-05-07T20:31:49.9379882Z             if compiled:
2025-05-07T20:31:49.9379982Z                 op = torch.compile(op)
2025-05-07T20:31:49.9380086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9380160Z     
2025-05-07T20:31:49.9380250Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9380255Z 
2025-05-07T20:31:49.9380350Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9380478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9380665Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9380769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9381222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9381316Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9381806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9381908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9382268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9382492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9382836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9382928Z     kernel = self.compile(
2025-05-07T20:31:49.9383318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9383495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9383621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9383626Z 
2025-05-07T20:31:49.9383837Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7812a00>
2025-05-07T20:31:49.9384664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9385166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72e0550>}
2025-05-07T20:31:49.9385978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9386181Z context = <triton._C.libtriton.ir.context object at 0x7f2fd730a770>
2025-05-07T20:31:49.9386185Z 
2025-05-07T20:31:49.9386353Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9386618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9386730Z                            module_map=module_map)
2025-05-07T20:31:49.9386892Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9386988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9387067Z E       ^
2025-05-07T20:31:49.9387417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9387422Z 
2025-05-07T20:31:49.9387846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9387855Z 
2025-05-07T20:31:49.9387955Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9388176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9388256Z     T=4096,
2025-05-07T20:31:49.9388331Z     D=5120,
2025-05-07T20:31:49.9388410Z     scale_ub=1200.0,
2025-05-07T20:31:49.9388497Z     contiguous=False,
2025-05-07T20:31:49.9388582Z     compiled=False,
2025-05-07T20:31:49.9388656Z )
2025-05-07T20:31:49.9388872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9389047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9389051Z 
2025-05-07T20:31:49.9389129Z     @given(
2025-05-07T20:31:49.9389248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9389346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9389467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9389664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9389779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9389855Z     )
2025-05-07T20:31:49.9390099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9390192Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9390269Z         self,
2025-05-07T20:31:49.9390346Z         T: int,
2025-05-07T20:31:49.9390429Z         D: int,
2025-05-07T20:31:49.9390526Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9390615Z         contiguous: bool,
2025-05-07T20:31:49.9390708Z         compiled: bool,
2025-05-07T20:31:49.9390784Z     ) -> None:
2025-05-07T20:31:49.9390878Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9390954Z     
2025-05-07T20:31:49.9391121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9391194Z     
2025-05-07T20:31:49.9391292Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9391424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9391514Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9391594Z         x0 = x[:, :D]
2025-05-07T20:31:49.9391672Z         x1 = x[:, D:]
2025-05-07T20:31:49.9391748Z     
2025-05-07T20:31:49.9391830Z         if contiguous:
2025-05-07T20:31:49.9391920Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9392012Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9392085Z     
2025-05-07T20:31:49.9392174Z         if scale_ub is not None:
2025-05-07T20:31:49.9392281Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9392416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9392491Z             )
2025-05-07T20:31:49.9392570Z         else:
2025-05-07T20:31:49.9392663Z             scale_ub_tensor = None
2025-05-07T20:31:49.9392735Z     
2025-05-07T20:31:49.9392870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9393041Z             op = silu_mul_quant
2025-05-07T20:31:49.9393131Z             if compiled:
2025-05-07T20:31:49.9393229Z                 op = torch.compile(op)
2025-05-07T20:31:49.9393333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9393406Z     
2025-05-07T20:31:49.9393495Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9393499Z 
2025-05-07T20:31:49.9393596Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9393725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9393824Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9393924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9394476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9394571Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9394936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9395169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9395512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9395611Z     kernel = self.compile(
2025-05-07T20:31:49.9395994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9396171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9396294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9396298Z 
2025-05-07T20:31:49.9396505Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd72f40a0>
2025-05-07T20:31:49.9397277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9397861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72a40d0>}
2025-05-07T20:31:49.9398603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9398795Z context = <triton._C.libtriton.ir.context object at 0x7f2fd72a81f0>
2025-05-07T20:31:49.9398800Z 
2025-05-07T20:31:49.9398969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9399234Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9399340Z                            module_map=module_map)
2025-05-07T20:31:49.9399507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9399615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9399692Z E       ^
2025-05-07T20:31:49.9400051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9400056Z 
2025-05-07T20:31:49.9400466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9400470Z 
2025-05-07T20:31:49.9400574Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9400795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9400871Z     T=4096,
2025-05-07T20:31:49.9400950Z     D=5120,
2025-05-07T20:31:49.9401032Z     scale_ub=1200.0,
2025-05-07T20:31:49.9401118Z     contiguous=False,
2025-05-07T20:31:49.9401203Z     compiled=True,
2025-05-07T20:31:49.9401274Z )
2025-05-07T20:31:49.9401491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9401744Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9401749Z 
2025-05-07T20:31:49.9401826Z     @given(
2025-05-07T20:31:49.9401947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9402045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9402161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9402279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9402391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9402464Z     )
2025-05-07T20:31:49.9402711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9402803Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9402878Z         self,
2025-05-07T20:31:49.9402956Z         T: int,
2025-05-07T20:31:49.9403031Z         D: int,
2025-05-07T20:31:49.9403130Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9403222Z         contiguous: bool,
2025-05-07T20:31:49.9403318Z         compiled: bool,
2025-05-07T20:31:49.9403398Z     ) -> None:
2025-05-07T20:31:49.9403492Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9403563Z     
2025-05-07T20:31:49.9403735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9403810Z     
2025-05-07T20:31:49.9403900Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9404026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9404114Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9404194Z         x0 = x[:, :D]
2025-05-07T20:31:49.9404277Z         x1 = x[:, D:]
2025-05-07T20:31:49.9404348Z     
2025-05-07T20:31:49.9404430Z         if contiguous:
2025-05-07T20:31:49.9404523Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9404611Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9404687Z     
2025-05-07T20:31:49.9404777Z         if scale_ub is not None:
2025-05-07T20:31:49.9404881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9405022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9405177Z             )
2025-05-07T20:31:49.9405254Z         else:
2025-05-07T20:31:49.9405350Z             scale_ub_tensor = None
2025-05-07T20:31:49.9405423Z     
2025-05-07T20:31:49.9405550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9405641Z             op = silu_mul_quant
2025-05-07T20:31:49.9405726Z             if compiled:
2025-05-07T20:31:49.9405825Z                 op = torch.compile(op)
2025-05-07T20:31:49.9405932Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9406004Z     
2025-05-07T20:31:49.9406096Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9406100Z 
2025-05-07T20:31:49.9406196Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9406322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9406423Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9406522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9406901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9407002Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9407491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9407589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9407946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9408169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9408514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9408606Z     kernel = self.compile(
2025-05-07T20:31:49.9408991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9409277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9409407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9409411Z 
2025-05-07T20:31:49.9409618Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd77c01c0>
2025-05-07T20:31:49.9410384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9410883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd72a4dc0>}
2025-05-07T20:31:49.9411622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9411822Z context = <triton._C.libtriton.ir.context object at 0x7f2fd72014f0>
2025-05-07T20:31:49.9411827Z 
2025-05-07T20:31:49.9411993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9412261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9412371Z                            module_map=module_map)
2025-05-07T20:31:49.9412533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9412630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9412710Z E       ^
2025-05-07T20:31:49.9413065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9413070Z 
2025-05-07T20:31:49.9413478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9413482Z 
2025-05-07T20:31:49.9413665Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9413886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9413965Z     T=2048,
2025-05-07T20:31:49.9414043Z     D=7168,
2025-05-07T20:31:49.9414137Z     scale_ub=1200.0,
2025-05-07T20:31:49.9414240Z     contiguous=False,
2025-05-07T20:31:49.9414339Z     compiled=False,
2025-05-07T20:31:49.9414422Z )
2025-05-07T20:31:49.9414644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9414818Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9414822Z 
2025-05-07T20:31:49.9414898Z     @given(
2025-05-07T20:31:49.9415017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9415115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9415230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9415349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9415474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9415550Z     )
2025-05-07T20:31:49.9415794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9415886Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9415964Z         self,
2025-05-07T20:31:49.9416041Z         T: int,
2025-05-07T20:31:49.9416116Z         D: int,
2025-05-07T20:31:49.9416215Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9416303Z         contiguous: bool,
2025-05-07T20:31:49.9416387Z         compiled: bool,
2025-05-07T20:31:49.9416466Z     ) -> None:
2025-05-07T20:31:49.9416564Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9416640Z     
2025-05-07T20:31:49.9416806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9416879Z     
2025-05-07T20:31:49.9416972Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9417095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9417264Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9417351Z         x0 = x[:, :D]
2025-05-07T20:31:49.9417433Z         x1 = x[:, D:]
2025-05-07T20:31:49.9417504Z     
2025-05-07T20:31:49.9417591Z         if contiguous:
2025-05-07T20:31:49.9417682Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9417770Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9417847Z     
2025-05-07T20:31:49.9417938Z         if scale_ub is not None:
2025-05-07T20:31:49.9418048Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9418185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9418260Z             )
2025-05-07T20:31:49.9418344Z         else:
2025-05-07T20:31:49.9418439Z             scale_ub_tensor = None
2025-05-07T20:31:49.9418510Z     
2025-05-07T20:31:49.9418643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9418732Z             op = silu_mul_quant
2025-05-07T20:31:49.9418817Z             if compiled:
2025-05-07T20:31:49.9418926Z                 op = torch.compile(op)
2025-05-07T20:31:49.9419036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9419107Z     
2025-05-07T20:31:49.9419201Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9419205Z 
2025-05-07T20:31:49.9419300Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9419432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9419531Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9419629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9420122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9420217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9420578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9420804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9421295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9421393Z     kernel = self.compile(
2025-05-07T20:31:49.9421777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9421953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9422087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9422091Z 
2025-05-07T20:31:49.9422297Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd72cd1c0>
2025-05-07T20:31:49.9423065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9423568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd71b0670>}
2025-05-07T20:31:49.9424361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9424554Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7231df0>
2025-05-07T20:31:49.9424559Z 
2025-05-07T20:31:49.9424726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9424997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9425103Z                            module_map=module_map)
2025-05-07T20:31:49.9425263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9425363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9425438Z E       ^
2025-05-07T20:31:49.9425873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9425887Z 
2025-05-07T20:31:49.9426305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9426309Z 
2025-05-07T20:31:49.9426410Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9426635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9426710Z     T=1,
2025-05-07T20:31:49.9426789Z     D=7168,
2025-05-07T20:31:49.9426873Z     scale_ub=None,
2025-05-07T20:31:49.9426957Z     contiguous=True,
2025-05-07T20:31:49.9427039Z     compiled=False,
2025-05-07T20:31:49.9427115Z )
2025-05-07T20:31:49.9427330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9427497Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9427502Z 
2025-05-07T20:31:49.9427578Z     @given(
2025-05-07T20:31:49.9427705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9427810Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9427924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9428040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9428158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9428231Z     )
2025-05-07T20:31:49.9428480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9428573Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9428648Z         self,
2025-05-07T20:31:49.9428725Z         T: int,
2025-05-07T20:31:49.9428800Z         D: int,
2025-05-07T20:31:49.9428897Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9428989Z         contiguous: bool,
2025-05-07T20:31:49.9429074Z         compiled: bool,
2025-05-07T20:31:49.9429153Z     ) -> None:
2025-05-07T20:31:49.9429252Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9429409Z     
2025-05-07T20:31:49.9429577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9429654Z     
2025-05-07T20:31:49.9429745Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9429873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9429961Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9430041Z         x0 = x[:, :D]
2025-05-07T20:31:49.9430122Z         x1 = x[:, D:]
2025-05-07T20:31:49.9430193Z     
2025-05-07T20:31:49.9430275Z         if contiguous:
2025-05-07T20:31:49.9430370Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9430459Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9430532Z     
2025-05-07T20:31:49.9430626Z         if scale_ub is not None:
2025-05-07T20:31:49.9430729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9430865Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9430942Z             )
2025-05-07T20:31:49.9431018Z         else:
2025-05-07T20:31:49.9431120Z             scale_ub_tensor = None
2025-05-07T20:31:49.9431194Z     
2025-05-07T20:31:49.9431326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9431417Z             op = silu_mul_quant
2025-05-07T20:31:49.9431501Z             if compiled:
2025-05-07T20:31:49.9431600Z                 op = torch.compile(op)
2025-05-07T20:31:49.9431709Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9431780Z     
2025-05-07T20:31:49.9431870Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9431874Z 
2025-05-07T20:31:49.9431973Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9432101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9432201Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9432302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9432796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9432974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9433339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9433565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9433912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9434005Z     kernel = self.compile(
2025-05-07T20:31:49.9434388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9434569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9434695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9434699Z 
2025-05-07T20:31:49.9434907Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd71d0fa0>
2025-05-07T20:31:49.9435681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9436182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ee3160>}
2025-05-07T20:31:49.9436930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9437121Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ed4c70>
2025-05-07T20:31:49.9437126Z 
2025-05-07T20:31:49.9437293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9437565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9437770Z                            module_map=module_map)
2025-05-07T20:31:49.9437931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9438028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9438109Z E       ^
2025-05-07T20:31:49.9438458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9438463Z 
2025-05-07T20:31:49.9438879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9438886Z 
2025-05-07T20:31:49.9438987Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9439208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9439286Z     T=16384,
2025-05-07T20:31:49.9439361Z     D=7168,
2025-05-07T20:31:49.9439443Z     scale_ub=1200.0,
2025-05-07T20:31:49.9439531Z     contiguous=False,
2025-05-07T20:31:49.9439622Z     compiled=True,
2025-05-07T20:31:49.9439695Z )
2025-05-07T20:31:49.9439915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9440451Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9440460Z 
2025-05-07T20:31:49.9440578Z     @given(
2025-05-07T20:31:49.9440738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9440867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9441031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9441164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9441278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9441355Z     )
2025-05-07T20:31:49.9441600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9441693Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9441773Z         self,
2025-05-07T20:31:49.9441850Z         T: int,
2025-05-07T20:31:49.9442077Z         D: int,
2025-05-07T20:31:49.9442183Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9442271Z         contiguous: bool,
2025-05-07T20:31:49.9442360Z         compiled: bool,
2025-05-07T20:31:49.9442438Z     ) -> None:
2025-05-07T20:31:49.9442531Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9442606Z     
2025-05-07T20:31:49.9442773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9442847Z     
2025-05-07T20:31:49.9442942Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9443068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9443155Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9443239Z         x0 = x[:, :D]
2025-05-07T20:31:49.9443318Z         x1 = x[:, D:]
2025-05-07T20:31:49.9443390Z     
2025-05-07T20:31:49.9443476Z         if contiguous:
2025-05-07T20:31:49.9443566Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9443653Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9443742Z     
2025-05-07T20:31:49.9443831Z         if scale_ub is not None:
2025-05-07T20:31:49.9443938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9444071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9444148Z             )
2025-05-07T20:31:49.9444229Z         else:
2025-05-07T20:31:49.9444322Z             scale_ub_tensor = None
2025-05-07T20:31:49.9444397Z     
2025-05-07T20:31:49.9444529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9444618Z             op = silu_mul_quant
2025-05-07T20:31:49.9444702Z             if compiled:
2025-05-07T20:31:49.9444805Z                 op = torch.compile(op)
2025-05-07T20:31:49.9444909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9444980Z     
2025-05-07T20:31:49.9445072Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9445077Z 
2025-05-07T20:31:49.9445172Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9445306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9445533Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9445634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9446000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9446092Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9446590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9446692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9447048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9447272Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9447606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9447708Z     kernel = self.compile(
2025-05-07T20:31:49.9448095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9448272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9448403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9448407Z 
2025-05-07T20:31:49.9448616Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7079880>
2025-05-07T20:31:49.9449379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9449889Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ee34c0>}
2025-05-07T20:31:49.9450719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9450915Z context = <triton._C.libtriton.ir.context object at 0x7f2fd7276270>
2025-05-07T20:31:49.9450919Z 
2025-05-07T20:31:49.9451083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9451350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9451460Z                            module_map=module_map)
2025-05-07T20:31:49.9451621Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9451722Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9451798Z E       ^
2025-05-07T20:31:49.9452149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9452153Z 
2025-05-07T20:31:49.9452583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9452588Z 
2025-05-07T20:31:49.9452688Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9452914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9452991Z     T=1,
2025-05-07T20:31:49.9453065Z     D=7168,
2025-05-07T20:31:49.9453150Z     scale_ub=None,
2025-05-07T20:31:49.9453234Z     contiguous=False,
2025-05-07T20:31:49.9453317Z     compiled=False,
2025-05-07T20:31:49.9453391Z )
2025-05-07T20:31:49.9453608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9453773Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9453777Z 
2025-05-07T20:31:49.9453857Z     @given(
2025-05-07T20:31:49.9453973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9454071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9454274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9454391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9454507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9454579Z     )
2025-05-07T20:31:49.9454824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9454921Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9454999Z         self,
2025-05-07T20:31:49.9455075Z         T: int,
2025-05-07T20:31:49.9455154Z         D: int,
2025-05-07T20:31:49.9455252Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9455340Z         contiguous: bool,
2025-05-07T20:31:49.9455428Z         compiled: bool,
2025-05-07T20:31:49.9455505Z     ) -> None:
2025-05-07T20:31:49.9455599Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9455675Z     
2025-05-07T20:31:49.9455845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9455927Z     
2025-05-07T20:31:49.9456022Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9456147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9456238Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9456317Z         x0 = x[:, :D]
2025-05-07T20:31:49.9456395Z         x1 = x[:, D:]
2025-05-07T20:31:49.9456469Z     
2025-05-07T20:31:49.9456551Z         if contiguous:
2025-05-07T20:31:49.9456641Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9456733Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9456804Z     
2025-05-07T20:31:49.9456897Z         if scale_ub is not None:
2025-05-07T20:31:49.9457005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9457139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9457216Z             )
2025-05-07T20:31:49.9457290Z         else:
2025-05-07T20:31:49.9457383Z             scale_ub_tensor = None
2025-05-07T20:31:49.9457459Z     
2025-05-07T20:31:49.9457666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9457763Z             op = silu_mul_quant
2025-05-07T20:31:49.9457854Z             if compiled:
2025-05-07T20:31:49.9457954Z                 op = torch.compile(op)
2025-05-07T20:31:49.9458057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9458131Z     
2025-05-07T20:31:49.9458221Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9458225Z 
2025-05-07T20:31:49.9458322Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9458456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9458555Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9458659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9459157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9459252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9459618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9459848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9460189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9460281Z     kernel = self.compile(
2025-05-07T20:31:49.9460659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9460833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9460958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9460963Z 
2025-05-07T20:31:49.9461227Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7093e20>
2025-05-07T20:31:49.9462015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9462605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd7023820>}
2025-05-07T20:31:49.9463357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9463549Z context = <triton._C.libtriton.ir.context object at 0x7f2fd738a2b0>
2025-05-07T20:31:49.9463553Z 
2025-05-07T20:31:49.9463726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9463993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9464098Z                            module_map=module_map)
2025-05-07T20:31:49.9464267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9464372Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9464449Z E       ^
2025-05-07T20:31:49.9464804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9464808Z 
2025-05-07T20:31:49.9465217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9465222Z 
2025-05-07T20:31:49.9465326Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9465549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9465624Z     T=2048,
2025-05-07T20:31:49.9465702Z     D=7168,
2025-05-07T20:31:49.9465782Z     scale_ub=None,
2025-05-07T20:31:49.9465866Z     contiguous=False,
2025-05-07T20:31:49.9465955Z     compiled=True,
2025-05-07T20:31:49.9466029Z )
2025-05-07T20:31:49.9466324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9466510Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9466515Z 
2025-05-07T20:31:49.9466590Z     @given(
2025-05-07T20:31:49.9466708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9466816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9466930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9467048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9467164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9467240Z     )
2025-05-07T20:31:49.9467495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9472239Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9472334Z         self,
2025-05-07T20:31:49.9472412Z         T: int,
2025-05-07T20:31:49.9472491Z         D: int,
2025-05-07T20:31:49.9472592Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9472694Z         contiguous: bool,
2025-05-07T20:31:49.9472782Z         compiled: bool,
2025-05-07T20:31:49.9472861Z     ) -> None:
2025-05-07T20:31:49.9472956Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9473033Z     
2025-05-07T20:31:49.9473208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9473284Z     
2025-05-07T20:31:49.9473384Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9473511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9473599Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9473683Z         x0 = x[:, :D]
2025-05-07T20:31:49.9473763Z         x1 = x[:, D:]
2025-05-07T20:31:49.9473841Z     
2025-05-07T20:31:49.9473924Z         if contiguous:
2025-05-07T20:31:49.9474015Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9474109Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9474185Z     
2025-05-07T20:31:49.9474277Z         if scale_ub is not None:
2025-05-07T20:31:49.9474388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9474658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9474734Z             )
2025-05-07T20:31:49.9474817Z         else:
2025-05-07T20:31:49.9474910Z             scale_ub_tensor = None
2025-05-07T20:31:49.9474983Z     
2025-05-07T20:31:49.9475121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9475211Z             op = silu_mul_quant
2025-05-07T20:31:49.9475299Z             if compiled:
2025-05-07T20:31:49.9475399Z                 op = torch.compile(op)
2025-05-07T20:31:49.9475504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9475579Z     
2025-05-07T20:31:49.9475671Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9475676Z 
2025-05-07T20:31:49.9475773Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9475903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9476004Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9476108Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9476494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9476586Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9477090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9477185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9477540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9477766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9478106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9478206Z     kernel = self.compile(
2025-05-07T20:31:49.9478659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9478842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9478970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9478974Z 
2025-05-07T20:31:49.9479185Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7375ca0>
2025-05-07T20:31:49.9479973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9480477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fe1790>}
2025-05-07T20:31:49.9481234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9481436Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ea0cb0>
2025-05-07T20:31:49.9481441Z 
2025-05-07T20:31:49.9481605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9481878Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9481986Z                            module_map=module_map)
2025-05-07T20:31:49.9482149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9482254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9482332Z E       ^
2025-05-07T20:31:49.9482689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9482693Z 
2025-05-07T20:31:49.9483106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9483189Z 
2025-05-07T20:31:49.9483294Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9483519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9483596Z     T=4096,
2025-05-07T20:31:49.9483674Z     D=7168,
2025-05-07T20:31:49.9483758Z     scale_ub=None,
2025-05-07T20:31:49.9483843Z     contiguous=False,
2025-05-07T20:31:49.9483925Z     compiled=True,
2025-05-07T20:31:49.9483999Z )
2025-05-07T20:31:49.9484248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9484437Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9484444Z 
2025-05-07T20:31:49.9484521Z     @given(
2025-05-07T20:31:49.9484639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9484741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9484854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9484974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9485094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9485167Z     )
2025-05-07T20:31:49.9485412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9485507Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9485582Z         self,
2025-05-07T20:31:49.9485659Z         T: int,
2025-05-07T20:31:49.9485740Z         D: int,
2025-05-07T20:31:49.9485837Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9485926Z         contiguous: bool,
2025-05-07T20:31:49.9486010Z         compiled: bool,
2025-05-07T20:31:49.9486087Z     ) -> None:
2025-05-07T20:31:49.9486182Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9486256Z     
2025-05-07T20:31:49.9486424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9486500Z     
2025-05-07T20:31:49.9486591Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9486715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9486890Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9486977Z         x0 = x[:, :D]
2025-05-07T20:31:49.9487058Z         x1 = x[:, D:]
2025-05-07T20:31:49.9487133Z     
2025-05-07T20:31:49.9487216Z         if contiguous:
2025-05-07T20:31:49.9487310Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9487399Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9487472Z     
2025-05-07T20:31:49.9487563Z         if scale_ub is not None:
2025-05-07T20:31:49.9487669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9487803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9487881Z             )
2025-05-07T20:31:49.9487957Z         else:
2025-05-07T20:31:49.9488051Z             scale_ub_tensor = None
2025-05-07T20:31:49.9488130Z     
2025-05-07T20:31:49.9488259Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9488348Z             op = silu_mul_quant
2025-05-07T20:31:49.9488436Z             if compiled:
2025-05-07T20:31:49.9488545Z                 op = torch.compile(op)
2025-05-07T20:31:49.9488653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9488726Z     
2025-05-07T20:31:49.9488816Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9488821Z 
2025-05-07T20:31:49.9488921Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9489048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9489148Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9489253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9489617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9489709Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9490212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9490307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9490756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9490981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9491316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9491412Z     kernel = self.compile(
2025-05-07T20:31:49.9491795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9491977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9492101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9492105Z 
2025-05-07T20:31:49.9492312Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6ed2d30>
2025-05-07T20:31:49.9493087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9493589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6f6a4c0>}
2025-05-07T20:31:49.9494378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9494571Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f84430>
2025-05-07T20:31:49.9494575Z 
2025-05-07T20:31:49.9494743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9495013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9495119Z                            module_map=module_map)
2025-05-07T20:31:49.9495363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9495462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9495539Z E       ^
2025-05-07T20:31:49.9495899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9495904Z 
2025-05-07T20:31:49.9496319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9496323Z 
2025-05-07T20:31:49.9496429Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9496650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9496727Z     T=16384,
2025-05-07T20:31:49.9496805Z     D=5120,
2025-05-07T20:31:49.9496888Z     scale_ub=1200.0,
2025-05-07T20:31:49.9496974Z     contiguous=False,
2025-05-07T20:31:49.9497067Z     compiled=False,
2025-05-07T20:31:49.9497139Z )
2025-05-07T20:31:49.9497365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9497547Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9497552Z 
2025-05-07T20:31:49.9497628Z     @given(
2025-05-07T20:31:49.9497752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9497852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9497965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9498083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9498195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9498268Z     )
2025-05-07T20:31:49.9498517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9498608Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9498685Z         self,
2025-05-07T20:31:49.9498765Z         T: int,
2025-05-07T20:31:49.9498841Z         D: int,
2025-05-07T20:31:49.9498941Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9499112Z         contiguous: bool,
2025-05-07T20:31:49.9499197Z         compiled: bool,
2025-05-07T20:31:49.9499278Z     ) -> None:
2025-05-07T20:31:49.9499372Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9499443Z     
2025-05-07T20:31:49.9499616Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9499689Z     
2025-05-07T20:31:49.9499779Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9499908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9499996Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9500075Z         x0 = x[:, :D]
2025-05-07T20:31:49.9500159Z         x1 = x[:, D:]
2025-05-07T20:31:49.9500233Z     
2025-05-07T20:31:49.9500315Z         if contiguous:
2025-05-07T20:31:49.9500411Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9500499Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9500573Z     
2025-05-07T20:31:49.9500664Z         if scale_ub is not None:
2025-05-07T20:31:49.9500781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9500919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9500994Z             )
2025-05-07T20:31:49.9501077Z         else:
2025-05-07T20:31:49.9501256Z             scale_ub_tensor = None
2025-05-07T20:31:49.9501332Z     
2025-05-07T20:31:49.9501462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9501556Z             op = silu_mul_quant
2025-05-07T20:31:49.9501641Z             if compiled:
2025-05-07T20:31:49.9501741Z                 op = torch.compile(op)
2025-05-07T20:31:49.9501849Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9501924Z     
2025-05-07T20:31:49.9502016Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9502021Z 
2025-05-07T20:31:49.9502117Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9502244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9502348Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9502558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9503063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9503163Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9503526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9503754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9504094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9504186Z     kernel = self.compile(
2025-05-07T20:31:49.9504574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9504754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9504890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9504897Z 
2025-05-07T20:31:49.9505104Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6dacf10>
2025-05-07T20:31:49.9505883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9506384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6f6a820>}
2025-05-07T20:31:49.9507119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9507312Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6df94f0>
2025-05-07T20:31:49.9507398Z 
2025-05-07T20:31:49.9507568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9507830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9507938Z                            module_map=module_map)
2025-05-07T20:31:49.9508104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9508201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9508284Z E       ^
2025-05-07T20:31:49.9508634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9508639Z 
2025-05-07T20:31:49.9509049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9509054Z 
2025-05-07T20:31:49.9509158Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9509384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9509469Z     T=16384,
2025-05-07T20:31:49.9509547Z     D=5120,
2025-05-07T20:31:49.9509629Z     scale_ub=1200.0,
2025-05-07T20:31:49.9509714Z     contiguous=True,
2025-05-07T20:31:49.9509796Z     compiled=True,
2025-05-07T20:31:49.9509868Z )
2025-05-07T20:31:49.9510086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9510259Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9510264Z 
2025-05-07T20:31:49.9510339Z     @given(
2025-05-07T20:31:49.9510464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9510563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9510676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9510796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9510910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9510988Z     )
2025-05-07T20:31:49.9511316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9511411Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9511490Z         self,
2025-05-07T20:31:49.9511566Z         T: int,
2025-05-07T20:31:49.9511642Z         D: int,
2025-05-07T20:31:49.9511743Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9511832Z         contiguous: bool,
2025-05-07T20:31:49.9511917Z         compiled: bool,
2025-05-07T20:31:49.9511998Z     ) -> None:
2025-05-07T20:31:49.9512093Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9512165Z     
2025-05-07T20:31:49.9512337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9512410Z     
2025-05-07T20:31:49.9512504Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9512629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9512716Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9512798Z         x0 = x[:, :D]
2025-05-07T20:31:49.9512885Z         x1 = x[:, D:]
2025-05-07T20:31:49.9512957Z     
2025-05-07T20:31:49.9513043Z         if contiguous:
2025-05-07T20:31:49.9513133Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9513221Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9513296Z     
2025-05-07T20:31:49.9513388Z         if scale_ub is not None:
2025-05-07T20:31:49.9513495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9513639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9513716Z             )
2025-05-07T20:31:49.9513795Z         else:
2025-05-07T20:31:49.9513887Z             scale_ub_tensor = None
2025-05-07T20:31:49.9513958Z     
2025-05-07T20:31:49.9514091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9514180Z             op = silu_mul_quant
2025-05-07T20:31:49.9514264Z             if compiled:
2025-05-07T20:31:49.9514368Z                 op = torch.compile(op)
2025-05-07T20:31:49.9514474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9514632Z     
2025-05-07T20:31:49.9514726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9514731Z 
2025-05-07T20:31:49.9514827Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9514954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9515060Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9515158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9515525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9515618Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9516114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9516213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9516569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9516807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9517149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9517244Z     kernel = self.compile(
2025-05-07T20:31:49.9517630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9517803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9517928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9517932Z 
2025-05-07T20:31:49.9518139Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6e135e0>
2025-05-07T20:31:49.9518902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9519490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6df6e50>}
2025-05-07T20:31:49.9520238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9520435Z context = <triton._C.libtriton.ir.context object at 0x7f2fd717d4b0>
2025-05-07T20:31:49.9520440Z 
2025-05-07T20:31:49.9520604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9520865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9520974Z                            module_map=module_map)
2025-05-07T20:31:49.9521135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9521237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9521324Z E       ^
2025-05-07T20:31:49.9521675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9521679Z 
2025-05-07T20:31:49.9522089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9522093Z 
2025-05-07T20:31:49.9522194Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9522415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9522497Z     T=16384,
2025-05-07T20:31:49.9522572Z     D=5120,
2025-05-07T20:31:49.9522653Z     scale_ub=None,
2025-05-07T20:31:49.9522743Z     contiguous=False,
2025-05-07T20:31:49.9522825Z     compiled=True,
2025-05-07T20:31:49.9522902Z )
2025-05-07T20:31:49.9523117Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9523302Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9523381Z 
2025-05-07T20:31:49.9523461Z     @given(
2025-05-07T20:31:49.9523578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9523675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9523793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9523909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9524038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9524121Z     )
2025-05-07T20:31:49.9524391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9524487Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9524561Z         self,
2025-05-07T20:31:49.9524637Z         T: int,
2025-05-07T20:31:49.9524714Z         D: int,
2025-05-07T20:31:49.9524810Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9524898Z         contiguous: bool,
2025-05-07T20:31:49.9524987Z         compiled: bool,
2025-05-07T20:31:49.9525074Z     ) -> None:
2025-05-07T20:31:49.9525169Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9525243Z     
2025-05-07T20:31:49.9525410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9525483Z     
2025-05-07T20:31:49.9525576Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9525699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9525789Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9525868Z         x0 = x[:, :D]
2025-05-07T20:31:49.9525946Z         x1 = x[:, D:]
2025-05-07T20:31:49.9526020Z     
2025-05-07T20:31:49.9526102Z         if contiguous:
2025-05-07T20:31:49.9526194Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9526286Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9526357Z     
2025-05-07T20:31:49.9526446Z         if scale_ub is not None:
2025-05-07T20:31:49.9526552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9526686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9526934Z             )
2025-05-07T20:31:49.9527018Z         else:
2025-05-07T20:31:49.9527115Z             scale_ub_tensor = None
2025-05-07T20:31:49.9527191Z     
2025-05-07T20:31:49.9527322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9527412Z             op = silu_mul_quant
2025-05-07T20:31:49.9527498Z             if compiled:
2025-05-07T20:31:49.9527597Z                 op = torch.compile(op)
2025-05-07T20:31:49.9527701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9527776Z     
2025-05-07T20:31:49.9527868Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9527872Z 
2025-05-07T20:31:49.9527968Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9528099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9528197Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9528297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9528675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9528771Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9529263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9529358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9529718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9529944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9530279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9530374Z     kernel = self.compile(
2025-05-07T20:31:49.9530750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9530934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9531140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9531144Z 
2025-05-07T20:31:49.9531349Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd7179af0>
2025-05-07T20:31:49.9532116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9532615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6d799d0>}
2025-05-07T20:31:49.9533350Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9533557Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f33b30>
2025-05-07T20:31:49.9533561Z 
2025-05-07T20:31:49.9533731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9533998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9534128Z                            module_map=module_map)
2025-05-07T20:31:49.9534316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9534416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9534492Z E       ^
2025-05-07T20:31:49.9534846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9534853Z 
2025-05-07T20:31:49.9535262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9535266Z 
2025-05-07T20:31:49.9535367Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9535694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9535773Z     T=2048,
2025-05-07T20:31:49.9535848Z     D=5120,
2025-05-07T20:31:49.9535932Z     scale_ub=None,
2025-05-07T20:31:49.9536016Z     contiguous=False,
2025-05-07T20:31:49.9536097Z     compiled=True,
2025-05-07T20:31:49.9536171Z )
2025-05-07T20:31:49.9536389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9536563Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9536567Z 
2025-05-07T20:31:49.9536642Z     @given(
2025-05-07T20:31:49.9536759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9536860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9536973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9537089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9537203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9537287Z     )
2025-05-07T20:31:49.9537535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9537630Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9537706Z         self,
2025-05-07T20:31:49.9537785Z         T: int,
2025-05-07T20:31:49.9537863Z         D: int,
2025-05-07T20:31:49.9537960Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9538053Z         contiguous: bool,
2025-05-07T20:31:49.9538136Z         compiled: bool,
2025-05-07T20:31:49.9538213Z     ) -> None:
2025-05-07T20:31:49.9538309Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9538379Z     
2025-05-07T20:31:49.9538549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9538625Z     
2025-05-07T20:31:49.9538715Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9538839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9538929Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9539092Z         x0 = x[:, :D]
2025-05-07T20:31:49.9539174Z         x1 = x[:, D:]
2025-05-07T20:31:49.9539247Z     
2025-05-07T20:31:49.9539328Z         if contiguous:
2025-05-07T20:31:49.9539421Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9539509Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9539579Z     
2025-05-07T20:31:49.9539670Z         if scale_ub is not None:
2025-05-07T20:31:49.9539780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9539914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9539993Z             )
2025-05-07T20:31:49.9540320Z         else:
2025-05-07T20:31:49.9540418Z             scale_ub_tensor = None
2025-05-07T20:31:49.9540493Z     
2025-05-07T20:31:49.9540624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9540718Z             op = silu_mul_quant
2025-05-07T20:31:49.9540804Z             if compiled:
2025-05-07T20:31:49.9540903Z                 op = torch.compile(op)
2025-05-07T20:31:49.9541025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9541096Z     
2025-05-07T20:31:49.9541245Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9541250Z 
2025-05-07T20:31:49.9541350Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9541478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9541577Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9541681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9542043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9542142Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9542639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9542734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9543229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9543466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9543800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9543898Z     kernel = self.compile(
2025-05-07T20:31:49.9544277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9544454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9544581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9544586Z 
2025-05-07T20:31:49.9544789Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6f43b50>
2025-05-07T20:31:49.9545562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9546073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6caa550>}
2025-05-07T20:31:49.9546811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9547001Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6cc40f0>
2025-05-07T20:31:49.9547006Z 
2025-05-07T20:31:49.9547173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9547440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9547546Z                            module_map=module_map)
2025-05-07T20:31:49.9547713Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9547934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9548011Z E       ^
2025-05-07T20:31:49.9548374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9548378Z 
2025-05-07T20:31:49.9548785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9548789Z 
2025-05-07T20:31:49.9548893Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9549114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9549189Z     T=2048,
2025-05-07T20:31:49.9549270Z     D=5120,
2025-05-07T20:31:49.9549351Z     scale_ub=1200.0,
2025-05-07T20:31:49.9549435Z     contiguous=False,
2025-05-07T20:31:49.9549521Z     compiled=True,
2025-05-07T20:31:49.9549592Z )
2025-05-07T20:31:49.9549807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9549995Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9550000Z 
2025-05-07T20:31:49.9550075Z     @given(
2025-05-07T20:31:49.9550196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9550294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9550407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9550524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9550636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9550710Z     )
2025-05-07T20:31:49.9550957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9551050Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9551129Z         self,
2025-05-07T20:31:49.9551205Z         T: int,
2025-05-07T20:31:49.9551280Z         D: int,
2025-05-07T20:31:49.9551380Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9551467Z         contiguous: bool,
2025-05-07T20:31:49.9551636Z         compiled: bool,
2025-05-07T20:31:49.9551719Z     ) -> None:
2025-05-07T20:31:49.9551814Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9551885Z     
2025-05-07T20:31:49.9552059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9552133Z     
2025-05-07T20:31:49.9552223Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9552352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9552440Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9552519Z         x0 = x[:, :D]
2025-05-07T20:31:49.9552599Z         x1 = x[:, D:]
2025-05-07T20:31:49.9552671Z     
2025-05-07T20:31:49.9552758Z         if contiguous:
2025-05-07T20:31:49.9552848Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9552936Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9553010Z     
2025-05-07T20:31:49.9553099Z         if scale_ub is not None:
2025-05-07T20:31:49.9553203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9553353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9553429Z             )
2025-05-07T20:31:49.9553506Z         else:
2025-05-07T20:31:49.9553602Z             scale_ub_tensor = None
2025-05-07T20:31:49.9553675Z     
2025-05-07T20:31:49.9553804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9553895Z             op = silu_mul_quant
2025-05-07T20:31:49.9553979Z             if compiled:
2025-05-07T20:31:49.9554082Z                 op = torch.compile(op)
2025-05-07T20:31:49.9554188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9554261Z     
2025-05-07T20:31:49.9554354Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9554358Z 
2025-05-07T20:31:49.9554453Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9554580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9554683Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9554780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9555236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9555329Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9555827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9555926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9556279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9556505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9556843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9556934Z     kernel = self.compile(
2025-05-07T20:31:49.9557319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9557506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9557632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9557636Z 
2025-05-07T20:31:49.9557844Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6e7a850>
2025-05-07T20:31:49.9558608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9559113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6c3b310>}
2025-05-07T20:31:49.9559923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9560121Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c15770>
2025-05-07T20:31:49.9560126Z 
2025-05-07T20:31:49.9560295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9560562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9560669Z                            module_map=module_map)
2025-05-07T20:31:49.9560834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9560933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9561016Z E       ^
2025-05-07T20:31:49.9561365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9561369Z 
2025-05-07T20:31:49.9561778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9561785Z 
2025-05-07T20:31:49.9561898Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9562118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9562198Z     T=4096,
2025-05-07T20:31:49.9562273Z     D=5120,
2025-05-07T20:31:49.9562357Z     scale_ub=1200.0,
2025-05-07T20:31:49.9562442Z     contiguous=True,
2025-05-07T20:31:49.9562524Z     compiled=True,
2025-05-07T20:31:49.9562595Z )
2025-05-07T20:31:49.9562816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9562987Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9562992Z 
2025-05-07T20:31:49.9563072Z     @given(
2025-05-07T20:31:49.9563187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9563284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9563399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9563513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9563739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9563816Z     )
2025-05-07T20:31:49.9564060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9564151Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9564230Z         self,
2025-05-07T20:31:49.9564306Z         T: int,
2025-05-07T20:31:49.9564381Z         D: int,
2025-05-07T20:31:49.9564480Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9564567Z         contiguous: bool,
2025-05-07T20:31:49.9564653Z         compiled: bool,
2025-05-07T20:31:49.9564729Z     ) -> None:
2025-05-07T20:31:49.9564821Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9564897Z     
2025-05-07T20:31:49.9565065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9565137Z     
2025-05-07T20:31:49.9565231Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9565359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9565460Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9565545Z         x0 = x[:, :D]
2025-05-07T20:31:49.9565624Z         x1 = x[:, D:]
2025-05-07T20:31:49.9565694Z     
2025-05-07T20:31:49.9565782Z         if contiguous:
2025-05-07T20:31:49.9565873Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9565960Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9566034Z     
2025-05-07T20:31:49.9566124Z         if scale_ub is not None:
2025-05-07T20:31:49.9566232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9566368Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9566442Z             )
2025-05-07T20:31:49.9566521Z         else:
2025-05-07T20:31:49.9566613Z             scale_ub_tensor = None
2025-05-07T20:31:49.9566690Z     
2025-05-07T20:31:49.9566821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9566909Z             op = silu_mul_quant
2025-05-07T20:31:49.9566993Z             if compiled:
2025-05-07T20:31:49.9567178Z                 op = torch.compile(op)
2025-05-07T20:31:49.9567285Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9567358Z     
2025-05-07T20:31:49.9567450Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9567454Z 
2025-05-07T20:31:49.9567550Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9567679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9567778Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9567879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9568245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9568336Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9568823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9568924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9569283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9569519Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9569854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9569946Z     kernel = self.compile(
2025-05-07T20:31:49.9570325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9570499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9570626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9570631Z 
2025-05-07T20:31:49.9570835Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6bd06d0>
2025-05-07T20:31:49.9571603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9572191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fb3040>}
2025-05-07T20:31:49.9572929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9573125Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6f96eb0>
2025-05-07T20:31:49.9573130Z 
2025-05-07T20:31:49.9573296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9573565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9573673Z                            module_map=module_map)
2025-05-07T20:31:49.9573843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9573950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9574026Z E       ^
2025-05-07T20:31:49.9574383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9574388Z 
2025-05-07T20:31:49.9574797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9574801Z 
2025-05-07T20:31:49.9574902Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9575125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9575202Z     T=128,
2025-05-07T20:31:49.9575276Z     D=5120,
2025-05-07T20:31:49.9575361Z     scale_ub=1200.0,
2025-05-07T20:31:49.9575447Z     contiguous=False,
2025-05-07T20:31:49.9575528Z     compiled=True,
2025-05-07T20:31:49.9575603Z )
2025-05-07T20:31:49.9575891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9576067Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9576071Z 
2025-05-07T20:31:49.9576152Z     @given(
2025-05-07T20:31:49.9576269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9576366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9576485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9576601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9576717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9576789Z     )
2025-05-07T20:31:49.9577033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9577129Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9577205Z         self,
2025-05-07T20:31:49.9577281Z         T: int,
2025-05-07T20:31:49.9577360Z         D: int,
2025-05-07T20:31:49.9577457Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9577557Z         contiguous: bool,
2025-05-07T20:31:49.9577646Z         compiled: bool,
2025-05-07T20:31:49.9577722Z     ) -> None:
2025-05-07T20:31:49.9577818Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9577892Z     
2025-05-07T20:31:49.9578058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9578135Z     
2025-05-07T20:31:49.9578225Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9578347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9578439Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9578519Z         x0 = x[:, :D]
2025-05-07T20:31:49.9578598Z         x1 = x[:, D:]
2025-05-07T20:31:49.9578674Z     
2025-05-07T20:31:49.9578757Z         if contiguous:
2025-05-07T20:31:49.9578847Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9578940Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9579012Z     
2025-05-07T20:31:49.9579101Z         if scale_ub is not None:
2025-05-07T20:31:49.9579298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9579434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9579510Z             )
2025-05-07T20:31:49.9579586Z         else:
2025-05-07T20:31:49.9579677Z             scale_ub_tensor = None
2025-05-07T20:31:49.9579751Z     
2025-05-07T20:31:49.9579879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9579968Z             op = silu_mul_quant
2025-05-07T20:31:49.9580057Z             if compiled:
2025-05-07T20:31:49.9580154Z                 op = torch.compile(op)
2025-05-07T20:31:49.9580258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9580332Z     
2025-05-07T20:31:49.9580421Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9580426Z 
2025-05-07T20:31:49.9580522Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9580651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9580749Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9580863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9581307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9581400Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9581903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9581998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9582358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9582581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9582923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9583017Z     kernel = self.compile(
2025-05-07T20:31:49.9583471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9583655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9583789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9583794Z 
2025-05-07T20:31:49.9584026Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6fc0a00>
2025-05-07T20:31:49.9584818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9585323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6fb3ca0>}
2025-05-07T20:31:49.9586075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9586271Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c6b4b0>
2025-05-07T20:31:49.9586276Z 
2025-05-07T20:31:49.9586442Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9586710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9586816Z                            module_map=module_map)
2025-05-07T20:31:49.9586977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9587075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9587150Z E       ^
2025-05-07T20:31:49.9587510Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9587514Z 
2025-05-07T20:31:49.9587925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9588004Z 
2025-05-07T20:31:49.9588109Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9588336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9588412Z     T=16384,
2025-05-07T20:31:49.9588490Z     D=7168,
2025-05-07T20:31:49.9588572Z     scale_ub=1200.0,
2025-05-07T20:31:49.9588655Z     contiguous=True,
2025-05-07T20:31:49.9588743Z     compiled=True,
2025-05-07T20:31:49.9588815Z )
2025-05-07T20:31:49.9589031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9589206Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9589210Z 
2025-05-07T20:31:49.9589286Z     @given(
2025-05-07T20:31:49.9589402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9589501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9589615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9589745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9589863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9589937Z     )
2025-05-07T20:31:49.9590183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9590277Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9590351Z         self,
2025-05-07T20:31:49.9590428Z         T: int,
2025-05-07T20:31:49.9590511Z         D: int,
2025-05-07T20:31:49.9590609Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9590702Z         contiguous: bool,
2025-05-07T20:31:49.9590787Z         compiled: bool,
2025-05-07T20:31:49.9590863Z     ) -> None:
2025-05-07T20:31:49.9590961Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9591032Z     
2025-05-07T20:31:49.9596062Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9596142Z     
2025-05-07T20:31:49.9596238Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9596474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9596571Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9596653Z         x0 = x[:, :D]
2025-05-07T20:31:49.9596732Z         x1 = x[:, D:]
2025-05-07T20:31:49.9596802Z     
2025-05-07T20:31:49.9596887Z         if contiguous:
2025-05-07T20:31:49.9596979Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9597067Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9597145Z     
2025-05-07T20:31:49.9597235Z         if scale_ub is not None:
2025-05-07T20:31:49.9597340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9597477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9597554Z             )
2025-05-07T20:31:49.9597635Z         else:
2025-05-07T20:31:49.9597729Z             scale_ub_tensor = None
2025-05-07T20:31:49.9597801Z     
2025-05-07T20:31:49.9597938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9598030Z             op = silu_mul_quant
2025-05-07T20:31:49.9598126Z             if compiled:
2025-05-07T20:31:49.9598229Z                 op = torch.compile(op)
2025-05-07T20:31:49.9598335Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9598407Z     
2025-05-07T20:31:49.9598502Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9598508Z 
2025-05-07T20:31:49.9598606Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9598734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9598839Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9598937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9599308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9599400Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9599891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9599989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9601040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9601269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9601604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9601696Z     kernel = self.compile(
2025-05-07T20:31:49.9602085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9602259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9602385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9602389Z 
2025-05-07T20:31:49.9602599Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6c86910>
2025-05-07T20:31:49.9603369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9603890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ceaa60>}
2025-05-07T20:31:49.9604677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9604873Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6a45a30>
2025-05-07T20:31:49.9604877Z 
2025-05-07T20:31:49.9605045Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9605311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9605494Z                            module_map=module_map)
2025-05-07T20:31:49.9605661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9605759Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9605841Z E       ^
2025-05-07T20:31:49.9606194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9606199Z 
2025-05-07T20:31:49.9606616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9606621Z 
2025-05-07T20:31:49.9606721Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9606941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9607021Z     T=16384,
2025-05-07T20:31:49.9607096Z     D=5120,
2025-05-07T20:31:49.9607178Z     scale_ub=1200.0,
2025-05-07T20:31:49.9607267Z     contiguous=True,
2025-05-07T20:31:49.9607353Z     compiled=False,
2025-05-07T20:31:49.9607432Z )
2025-05-07T20:31:49.9607654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9607834Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9607838Z 
2025-05-07T20:31:49.9607918Z     @given(
2025-05-07T20:31:49.9608035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9608135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9608253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9608370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9608485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9608562Z     )
2025-05-07T20:31:49.9608806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9608901Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9608978Z         self,
2025-05-07T20:31:49.9609053Z         T: int,
2025-05-07T20:31:49.9609131Z         D: int,
2025-05-07T20:31:49.9609315Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9609404Z         contiguous: bool,
2025-05-07T20:31:49.9609492Z         compiled: bool,
2025-05-07T20:31:49.9609569Z     ) -> None:
2025-05-07T20:31:49.9609664Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9609738Z     
2025-05-07T20:31:49.9609905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9609978Z     
2025-05-07T20:31:49.9610072Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9610196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9610286Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9610366Z         x0 = x[:, :D]
2025-05-07T20:31:49.9610444Z         x1 = x[:, D:]
2025-05-07T20:31:49.9610518Z     
2025-05-07T20:31:49.9610603Z         if contiguous:
2025-05-07T20:31:49.9610693Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9610783Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9610856Z     
2025-05-07T20:31:49.9610955Z         if scale_ub is not None:
2025-05-07T20:31:49.9611065Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9611202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9611279Z             )
2025-05-07T20:31:49.9611360Z         else:
2025-05-07T20:31:49.9611452Z             scale_ub_tensor = None
2025-05-07T20:31:49.9611524Z     
2025-05-07T20:31:49.9611658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9611749Z             op = silu_mul_quant
2025-05-07T20:31:49.9611835Z             if compiled:
2025-05-07T20:31:49.9611933Z                 op = torch.compile(op)
2025-05-07T20:31:49.9612037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9612113Z     
2025-05-07T20:31:49.9612202Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9612206Z 
2025-05-07T20:31:49.9612302Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9612433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9612614Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9612715Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9613217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9613312Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9613677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9613900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9614235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9614330Z     kernel = self.compile(
2025-05-07T20:31:49.9614708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9614894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9615031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9615036Z 
2025-05-07T20:31:49.9615244Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a1f460>
2025-05-07T20:31:49.9616013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9616518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6c08550>}
2025-05-07T20:31:49.9617258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9617452Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6c134b0>
2025-05-07T20:31:49.9617531Z 
2025-05-07T20:31:49.9617700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9617972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9618078Z                            module_map=module_map)
2025-05-07T20:31:49.9618242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9618340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9618416Z E       ^
2025-05-07T20:31:49.9618773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9618777Z 
2025-05-07T20:31:49.9619189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9619194Z 
2025-05-07T20:31:49.9619300Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9619530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9619607Z     T=1,
2025-05-07T20:31:49.9619684Z     D=7168,
2025-05-07T20:31:49.9619766Z     scale_ub=1200.0,
2025-05-07T20:31:49.9619851Z     contiguous=False,
2025-05-07T20:31:49.9619937Z     compiled=False,
2025-05-07T20:31:49.9620009Z )
2025-05-07T20:31:49.9620225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9620395Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9620400Z 
2025-05-07T20:31:49.9620476Z     @given(
2025-05-07T20:31:49.9620598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9620696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9620810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9620928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9621040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9621309Z     )
2025-05-07T20:31:49.9621564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9621656Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9621733Z         self,
2025-05-07T20:31:49.9621811Z         T: int,
2025-05-07T20:31:49.9621887Z         D: int,
2025-05-07T20:31:49.9621988Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9622078Z         contiguous: bool,
2025-05-07T20:31:49.9622162Z         compiled: bool,
2025-05-07T20:31:49.9622242Z     ) -> None:
2025-05-07T20:31:49.9622334Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9622405Z     
2025-05-07T20:31:49.9622581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9622654Z     
2025-05-07T20:31:49.9622744Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9622875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9622965Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9623045Z         x0 = x[:, :D]
2025-05-07T20:31:49.9623142Z         x1 = x[:, D:]
2025-05-07T20:31:49.9623214Z     
2025-05-07T20:31:49.9623296Z         if contiguous:
2025-05-07T20:31:49.9623389Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9623478Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9623556Z     
2025-05-07T20:31:49.9623646Z         if scale_ub is not None:
2025-05-07T20:31:49.9623750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9623887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9623963Z             )
2025-05-07T20:31:49.9624053Z         else:
2025-05-07T20:31:49.9624159Z             scale_ub_tensor = None
2025-05-07T20:31:49.9624248Z     
2025-05-07T20:31:49.9624388Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9624482Z             op = silu_mul_quant
2025-05-07T20:31:49.9624568Z             if compiled:
2025-05-07T20:31:49.9624666Z                 op = torch.compile(op)
2025-05-07T20:31:49.9624780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9624957Z     
2025-05-07T20:31:49.9625052Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9625056Z 
2025-05-07T20:31:49.9625151Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9625281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9625386Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9625485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9625980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9626081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9626444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9626670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9627016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9627117Z     kernel = self.compile(
2025-05-07T20:31:49.9627504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9627681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9627807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9627814Z 
2025-05-07T20:31:49.9628018Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a149a0>
2025-05-07T20:31:49.9628785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9629374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6ceae50>}
2025-05-07T20:31:49.9630132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9630326Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6b5f4f0>
2025-05-07T20:31:49.9630330Z 
2025-05-07T20:31:49.9630498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9630766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9630876Z                            module_map=module_map)
2025-05-07T20:31:49.9631041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9631139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9631218Z E       ^
2025-05-07T20:31:49.9631577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9631586Z 
2025-05-07T20:31:49.9632004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9632008Z 
2025-05-07T20:31:49.9632108Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9632328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9632406Z     T=4096,
2025-05-07T20:31:49.9632482Z     D=7168,
2025-05-07T20:31:49.9632567Z     scale_ub=1200.0,
2025-05-07T20:31:49.9632651Z     contiguous=False,
2025-05-07T20:31:49.9632736Z     compiled=True,
2025-05-07T20:31:49.9632809Z )
2025-05-07T20:31:49.9633028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9633204Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9633208Z 
2025-05-07T20:31:49.9633285Z     @given(
2025-05-07T20:31:49.9633410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9633585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9633702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9633820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9633931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9634006Z     )
2025-05-07T20:31:49.9634287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9634391Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9634471Z         self,
2025-05-07T20:31:49.9634546Z         T: int,
2025-05-07T20:31:49.9634620Z         D: int,
2025-05-07T20:31:49.9634727Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9634816Z         contiguous: bool,
2025-05-07T20:31:49.9634899Z         compiled: bool,
2025-05-07T20:31:49.9634979Z     ) -> None:
2025-05-07T20:31:49.9635071Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9635142Z     
2025-05-07T20:31:49.9635319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9635398Z     
2025-05-07T20:31:49.9635494Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9635620Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9635708Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9635790Z         x0 = x[:, :D]
2025-05-07T20:31:49.9635871Z         x1 = x[:, D:]
2025-05-07T20:31:49.9635943Z     
2025-05-07T20:31:49.9636029Z         if contiguous:
2025-05-07T20:31:49.9636120Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9636208Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9636281Z     
2025-05-07T20:31:49.9636370Z         if scale_ub is not None:
2025-05-07T20:31:49.9636473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9636611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9636686Z             )
2025-05-07T20:31:49.9636765Z         else:
2025-05-07T20:31:49.9636857Z             scale_ub_tensor = None
2025-05-07T20:31:49.9637010Z     
2025-05-07T20:31:49.9637148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9637238Z             op = silu_mul_quant
2025-05-07T20:31:49.9637323Z             if compiled:
2025-05-07T20:31:49.9637426Z                 op = torch.compile(op)
2025-05-07T20:31:49.9637531Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9637604Z     
2025-05-07T20:31:49.9637699Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9637703Z 
2025-05-07T20:31:49.9637801Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9637930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9638029Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9638128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9638493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9638584Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9639085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9639186Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9639540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9639769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9640275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9640370Z     kernel = self.compile(
2025-05-07T20:31:49.9640766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9640941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9641066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9641205Z 
2025-05-07T20:31:49.9641424Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a3f5b0>
2025-05-07T20:31:49.9642205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9642719Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6b5eee0>}
2025-05-07T20:31:49.9643458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9643652Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6af6970>
2025-05-07T20:31:49.9643657Z 
2025-05-07T20:31:49.9643832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9644100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9644211Z                            module_map=module_map)
2025-05-07T20:31:49.9644372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9644468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9644546Z E       ^
2025-05-07T20:31:49.9644902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9644907Z 
2025-05-07T20:31:49.9645326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9645330Z 
2025-05-07T20:31:49.9645430Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9645650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9645730Z     T=128,
2025-05-07T20:31:49.9645917Z     D=7168,
2025-05-07T20:31:49.9646002Z     scale_ub=1200.0,
2025-05-07T20:31:49.9646090Z     contiguous=False,
2025-05-07T20:31:49.9646171Z     compiled=True,
2025-05-07T20:31:49.9646248Z )
2025-05-07T20:31:49.9646466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9646636Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.9646640Z 
2025-05-07T20:31:49.9646719Z     @given(
2025-05-07T20:31:49.9646835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9646931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9647050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9647166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9647286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9647358Z     )
2025-05-07T20:31:49.9647602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9647713Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9647789Z         self,
2025-05-07T20:31:49.9647864Z         T: int,
2025-05-07T20:31:49.9647943Z         D: int,
2025-05-07T20:31:49.9648039Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9648127Z         contiguous: bool,
2025-05-07T20:31:49.9648215Z         compiled: bool,
2025-05-07T20:31:49.9648293Z     ) -> None:
2025-05-07T20:31:49.9648387Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9648461Z     
2025-05-07T20:31:49.9648630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9648704Z     
2025-05-07T20:31:49.9648797Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9648920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9649010Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9649090Z         x0 = x[:, :D]
2025-05-07T20:31:49.9649169Z         x1 = x[:, D:]
2025-05-07T20:31:49.9649243Z     
2025-05-07T20:31:49.9649325Z         if contiguous:
2025-05-07T20:31:49.9649506Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9649598Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9649670Z     
2025-05-07T20:31:49.9649761Z         if scale_ub is not None:
2025-05-07T20:31:49.9649868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9650002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9650077Z             )
2025-05-07T20:31:49.9650154Z         else:
2025-05-07T20:31:49.9650246Z             scale_ub_tensor = None
2025-05-07T20:31:49.9650320Z     
2025-05-07T20:31:49.9650449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9650539Z             op = silu_mul_quant
2025-05-07T20:31:49.9650626Z             if compiled:
2025-05-07T20:31:49.9650724Z                 op = torch.compile(op)
2025-05-07T20:31:49.9650828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9650902Z     
2025-05-07T20:31:49.9650991Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9651004Z 
2025-05-07T20:31:49.9651105Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9651236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9651335Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9651437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9651803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9651894Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9652384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9652479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9652840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9653064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9653503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9653601Z     kernel = self.compile(
2025-05-07T20:31:49.9653979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9654156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9654283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9654287Z 
2025-05-07T20:31:49.9654493Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6d16910>
2025-05-07T20:31:49.9655262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9655765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6b1eaf0>}
2025-05-07T20:31:49.9656504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9656698Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6a10fb0>
2025-05-07T20:31:49.9656703Z 
2025-05-07T20:31:49.9656871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9657141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9657248Z                            module_map=module_map)
2025-05-07T20:31:49.9657408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9657508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9657584Z E       ^
2025-05-07T20:31:49.9657943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9658029Z 
2025-05-07T20:31:49.9658447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9658451Z 
2025-05-07T20:31:49.9658552Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9658779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9658859Z     T=2048,
2025-05-07T20:31:49.9658938Z     D=7168,
2025-05-07T20:31:49.9659021Z     scale_ub=None,
2025-05-07T20:31:49.9659105Z     contiguous=True,
2025-05-07T20:31:49.9659192Z     compiled=True,
2025-05-07T20:31:49.9659268Z )
2025-05-07T20:31:49.9659484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9659654Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9659659Z 
2025-05-07T20:31:49.9659735Z     @given(
2025-05-07T20:31:49.9659862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9659964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9660079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9660194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9660311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9660384Z     )
2025-05-07T20:31:49.9660631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9660723Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9660800Z         self,
2025-05-07T20:31:49.9660878Z         T: int,
2025-05-07T20:31:49.9660953Z         D: int,
2025-05-07T20:31:49.9661049Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9661186Z         contiguous: bool,
2025-05-07T20:31:49.9661271Z         compiled: bool,
2025-05-07T20:31:49.9661349Z     ) -> None:
2025-05-07T20:31:49.9661444Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9661522Z     
2025-05-07T20:31:49.9661835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9661916Z     
2025-05-07T20:31:49.9662007Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9662132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9662227Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9662307Z         x0 = x[:, :D]
2025-05-07T20:31:49.9662391Z         x1 = x[:, D:]
2025-05-07T20:31:49.9662462Z     
2025-05-07T20:31:49.9662544Z         if contiguous:
2025-05-07T20:31:49.9662637Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9662724Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9662797Z     
2025-05-07T20:31:49.9662890Z         if scale_ub is not None:
2025-05-07T20:31:49.9662995Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9663128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9663205Z             )
2025-05-07T20:31:49.9663280Z         else:
2025-05-07T20:31:49.9663386Z             scale_ub_tensor = None
2025-05-07T20:31:49.9663462Z     
2025-05-07T20:31:49.9663590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9663681Z             op = silu_mul_quant
2025-05-07T20:31:49.9663768Z             if compiled:
2025-05-07T20:31:49.9663866Z                 op = torch.compile(op)
2025-05-07T20:31:49.9663972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9664046Z     
2025-05-07T20:31:49.9664135Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9664140Z 
2025-05-07T20:31:49.9664240Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9664365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9664464Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9664566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9664931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9665111Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9665611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9665708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9666066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9666292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9666628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9666725Z     kernel = self.compile(
2025-05-07T20:31:49.9667104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9667283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9667415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9667424Z 
2025-05-07T20:31:49.9667628Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd68a09d0>
2025-05-07T20:31:49.9668397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9668896Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd68b48b0>}
2025-05-07T20:31:49.9669653Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9669843Z context = <triton._C.libtriton.ir.context object at 0x7f2fd6ab48f0>
2025-05-07T20:31:49.9669848Z 
2025-05-07T20:31:49.9670096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9670367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9670474Z                            module_map=module_map)
2025-05-07T20:31:49.9670640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9670737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9670814Z E       ^
2025-05-07T20:31:49.9671169Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9671173Z 
2025-05-07T20:31:49.9671580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9671585Z 
2025-05-07T20:31:49.9671688Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9671908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9671994Z     T=16384,
2025-05-07T20:31:49.9672073Z     D=5120,
2025-05-07T20:31:49.9672154Z     scale_ub=None,
2025-05-07T20:31:49.9672240Z     contiguous=False,
2025-05-07T20:31:49.9672325Z     compiled=False,
2025-05-07T20:31:49.9672398Z )
2025-05-07T20:31:49.9672613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9672795Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9672799Z 
2025-05-07T20:31:49.9672876Z     @given(
2025-05-07T20:31:49.9672995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9673092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9673205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9673325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9673436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9673508Z     )
2025-05-07T20:31:49.9673761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9673931Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9674009Z         self,
2025-05-07T20:31:49.9674085Z         T: int,
2025-05-07T20:31:49.9674159Z         D: int,
2025-05-07T20:31:49.9674259Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9674347Z         contiguous: bool,
2025-05-07T20:31:49.9674430Z         compiled: bool,
2025-05-07T20:31:49.9674509Z     ) -> None:
2025-05-07T20:31:49.9674602Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9674675Z     
2025-05-07T20:31:49.9674847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9674920Z     
2025-05-07T20:31:49.9675011Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9675138Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9676954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9676967Z 
2025-05-07T20:31:49.9677089Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.9677093Z 
2025-05-07T20:31:49.9677193Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9677417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9677493Z     T=4096,
2025-05-07T20:31:49.9677569Z     D=7168,
2025-05-07T20:31:49.9677653Z     scale_ub=1200.0,
2025-05-07T20:31:49.9677735Z     contiguous=True,
2025-05-07T20:31:49.9677818Z     compiled=True,
2025-05-07T20:31:49.9677895Z )
2025-05-07T20:31:49.9678189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9678368Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9678373Z 
2025-05-07T20:31:49.9678451Z     @given(
2025-05-07T20:31:49.9678566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9678668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9678780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9678895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9679008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9679082Z     )
2025-05-07T20:31:49.9679325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9679421Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9679496Z         self,
2025-05-07T20:31:49.9679572Z         T: int,
2025-05-07T20:31:49.9679652Z         D: int,
2025-05-07T20:31:49.9679747Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9679848Z         contiguous: bool,
2025-05-07T20:31:49.9679935Z         compiled: bool,
2025-05-07T20:31:49.9680014Z     ) -> None:
2025-05-07T20:31:49.9680110Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9680183Z     
2025-05-07T20:31:49.9680351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9680427Z     
2025-05-07T20:31:49.9680519Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9680642Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9682433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9682519Z 
2025-05-07T20:31:49.9682638Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.9682643Z 
2025-05-07T20:31:49.9682747Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9682967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9683049Z     T=16384,
2025-05-07T20:31:49.9683126Z     D=7168,
2025-05-07T20:31:49.9683207Z     scale_ub=None,
2025-05-07T20:31:49.9683294Z     contiguous=False,
2025-05-07T20:31:49.9683378Z     compiled=False,
2025-05-07T20:31:49.9683450Z )
2025-05-07T20:31:49.9683666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9683840Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9683844Z 
2025-05-07T20:31:49.9683920Z     @given(
2025-05-07T20:31:49.9684046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9684151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9684263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9684383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9684496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9684571Z     )
2025-05-07T20:31:49.9684820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9684912Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9684990Z         self,
2025-05-07T20:31:49.9685067Z         T: int,
2025-05-07T20:31:49.9685143Z         D: int,
2025-05-07T20:31:49.9685244Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9685332Z         contiguous: bool,
2025-05-07T20:31:49.9685417Z         compiled: bool,
2025-05-07T20:31:49.9685496Z     ) -> None:
2025-05-07T20:31:49.9685591Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9685663Z     
2025-05-07T20:31:49.9685938Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9687730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9687740Z 
2025-05-07T20:31:49.9687854Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9687858Z 
2025-05-07T20:31:49.9687958Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9688181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9688261Z     T=2048,
2025-05-07T20:31:49.9688341Z     D=7168,
2025-05-07T20:31:49.9688426Z     scale_ub=1200.0,
2025-05-07T20:31:49.9688508Z     contiguous=True,
2025-05-07T20:31:49.9688589Z     compiled=True,
2025-05-07T20:31:49.9688663Z )
2025-05-07T20:31:49.9688877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9689048Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9689058Z 
2025-05-07T20:31:49.9689133Z     @given(
2025-05-07T20:31:49.9689248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9689348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9689463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9689578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9689694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9689766Z     )
2025-05-07T20:31:49.9690020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9690191Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9690265Z         self,
2025-05-07T20:31:49.9690343Z         T: int,
2025-05-07T20:31:49.9690419Z         D: int,
2025-05-07T20:31:49.9690516Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9690606Z         contiguous: bool,
2025-05-07T20:31:49.9690691Z         compiled: bool,
2025-05-07T20:31:49.9690770Z     ) -> None:
2025-05-07T20:31:49.9690867Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9690941Z     
2025-05-07T20:31:49.9691108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9691184Z     
2025-05-07T20:31:49.9691274Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9691399Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9693148Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9693160Z 
2025-05-07T20:31:49.9693276Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.9693281Z 
2025-05-07T20:31:49.9693387Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9693608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9693686Z     T=2048,
2025-05-07T20:31:49.9693760Z     D=7168,
2025-05-07T20:31:49.9693844Z     scale_ub=None,
2025-05-07T20:31:49.9693934Z     contiguous=True,
2025-05-07T20:31:49.9694017Z     compiled=False,
2025-05-07T20:31:49.9694088Z )
2025-05-07T20:31:49.9694382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9694563Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9694568Z 
2025-05-07T20:31:49.9694644Z     @given(
2025-05-07T20:31:49.9694764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9694861Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9694977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9695092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9695203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9695280Z     )
2025-05-07T20:31:49.9695527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9695619Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9695698Z         self,
2025-05-07T20:31:49.9695773Z         T: int,
2025-05-07T20:31:49.9695851Z         D: int,
2025-05-07T20:31:49.9695950Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9696049Z         contiguous: bool,
2025-05-07T20:31:49.9696134Z         compiled: bool,
2025-05-07T20:31:49.9696214Z     ) -> None:
2025-05-07T20:31:49.9696306Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9696381Z     
2025-05-07T20:31:49.9696546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9696619Z     
2025-05-07T20:31:49.9696713Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.9698494Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9698581Z 
2025-05-07T20:31:49.9698709Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.9698714Z 
2025-05-07T20:31:49.9698813Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9699033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9699111Z     T=1,
2025-05-07T20:31:49.9699185Z     D=7168,
2025-05-07T20:31:49.9699267Z     scale_ub=1200.0,
2025-05-07T20:31:49.9699358Z     contiguous=True,
2025-05-07T20:31:49.9699440Z     compiled=False,
2025-05-07T20:31:49.9699513Z )
2025-05-07T20:31:49.9699730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9699895Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9699900Z 
2025-05-07T20:31:49.9699977Z     @given(
2025-05-07T20:31:49.9700093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9700190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9700315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9700434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9700546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9700621Z     )
2025-05-07T20:31:49.9700866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9700963Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9701037Z         self,
2025-05-07T20:31:49.9701112Z         T: int,
2025-05-07T20:31:49.9701257Z         D: int,
2025-05-07T20:31:49.9701354Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9701442Z         contiguous: bool,
2025-05-07T20:31:49.9701531Z         compiled: bool,
2025-05-07T20:31:49.9701607Z     ) -> None:
2025-05-07T20:31:49.9701699Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9701773Z     
2025-05-07T20:31:49.9701942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9702014Z     
2025-05-07T20:31:49.9702196Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9702323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9702415Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9702499Z         x0 = x[:, :D]
2025-05-07T20:31:49.9702578Z         x1 = x[:, D:]
2025-05-07T20:31:49.9702652Z     
2025-05-07T20:31:49.9702734Z         if contiguous:
2025-05-07T20:31:49.9702824Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9702917Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9702988Z     
2025-05-07T20:31:49.9703077Z         if scale_ub is not None:
2025-05-07T20:31:49.9703185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9703323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9703397Z             )
2025-05-07T20:31:49.9703477Z         else:
2025-05-07T20:31:49.9703570Z             scale_ub_tensor = None
2025-05-07T20:31:49.9703642Z     
2025-05-07T20:31:49.9703778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9703877Z             op = silu_mul_quant
2025-05-07T20:31:49.9703967Z             if compiled:
2025-05-07T20:31:49.9704066Z                 op = torch.compile(op)
2025-05-07T20:31:49.9704170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9704243Z     
2025-05-07T20:31:49.9704333Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9704338Z 
2025-05-07T20:31:49.9704434Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9704563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9704661Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9704760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9705270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9705367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9705746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9706052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9706390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9706486Z     kernel = self.compile(
2025-05-07T20:31:49.9706870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9707051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9707176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9707180Z 
2025-05-07T20:31:49.9707382Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6a82340>
2025-05-07T20:31:49.9708157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9708666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd6798550>}
2025-05-07T20:31:49.9709418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9709609Z context = <triton._C.libtriton.ir.context object at 0x7f2fd67a5530>
2025-05-07T20:31:49.9709614Z 
2025-05-07T20:31:49.9709779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9710047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9710153Z                            module_map=module_map)
2025-05-07T20:31:49.9710317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9710496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9710576Z E       ^
2025-05-07T20:31:49.9710932Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9710936Z 
2025-05-07T20:31:49.9711347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9711351Z 
2025-05-07T20:31:49.9711457Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9711677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9711754Z     T=128,
2025-05-07T20:31:49.9711832Z     D=5120,
2025-05-07T20:31:49.9711913Z     scale_ub=None,
2025-05-07T20:31:49.9711997Z     contiguous=True,
2025-05-07T20:31:49.9712081Z     compiled=False,
2025-05-07T20:31:49.9712152Z )
2025-05-07T20:31:49.9712366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9712555Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9712559Z 
2025-05-07T20:31:49.9712635Z     @given(
2025-05-07T20:31:49.9712757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9712855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9712968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9713088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9713202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9713274Z     )
2025-05-07T20:31:49.9713529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9713621Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9713699Z         self,
2025-05-07T20:31:49.9713778Z         T: int,
2025-05-07T20:31:49.9713852Z         D: int,
2025-05-07T20:31:49.9713949Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9714039Z         contiguous: bool,
2025-05-07T20:31:49.9714231Z         compiled: bool,
2025-05-07T20:31:49.9714311Z     ) -> None:
2025-05-07T20:31:49.9714405Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9714481Z     
2025-05-07T20:31:49.9714653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9714727Z     
2025-05-07T20:31:49.9714817Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9714942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9715039Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9715120Z         x0 = x[:, :D]
2025-05-07T20:31:49.9715203Z         x1 = x[:, D:]
2025-05-07T20:31:49.9715276Z     
2025-05-07T20:31:49.9715357Z         if contiguous:
2025-05-07T20:31:49.9715454Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9715542Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9720366Z     
2025-05-07T20:31:49.9720466Z         if scale_ub is not None:
2025-05-07T20:31:49.9720575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9720727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9720803Z             )
2025-05-07T20:31:49.9720881Z         else:
2025-05-07T20:31:49.9720973Z             scale_ub_tensor = None
2025-05-07T20:31:49.9721045Z     
2025-05-07T20:31:49.9721180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9721271Z             op = silu_mul_quant
2025-05-07T20:31:49.9721359Z             if compiled:
2025-05-07T20:31:49.9721466Z                 op = torch.compile(op)
2025-05-07T20:31:49.9721572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9721643Z     
2025-05-07T20:31:49.9721736Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9721741Z 
2025-05-07T20:31:49.9721838Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9721971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9722071Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9722171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9722788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9722889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9723247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9723477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9723825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9723924Z     kernel = self.compile(
2025-05-07T20:31:49.9724303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9724483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9724611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9724628Z 
2025-05-07T20:31:49.9724839Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd67c7760>
2025-05-07T20:31:49.9725621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9726122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd696b040>}
2025-05-07T20:31:49.9726866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9727059Z context = <triton._C.libtriton.ir.context object at 0x7f2fd696a2b0>
2025-05-07T20:31:49.9727063Z 
2025-05-07T20:31:49.9727234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9727584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9727690Z                            module_map=module_map)
2025-05-07T20:31:49.9727854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9727955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9728031Z E       ^
2025-05-07T20:31:49.9728386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9728390Z 
2025-05-07T20:31:49.9728798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9728803Z 
2025-05-07T20:31:49.9728905Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9729130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9729207Z     T=128,
2025-05-07T20:31:49.9729295Z     D=7168,
2025-05-07T20:31:49.9729382Z     scale_ub=None,
2025-05-07T20:31:49.9729467Z     contiguous=True,
2025-05-07T20:31:49.9729555Z     compiled=False,
2025-05-07T20:31:49.9729627Z )
2025-05-07T20:31:49.9729842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9730016Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9730020Z 
2025-05-07T20:31:49.9730098Z     @given(
2025-05-07T20:31:49.9730218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9730321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9730434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9730549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9730670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9730743Z     )
2025-05-07T20:31:49.9730996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9731178Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9731260Z         self,
2025-05-07T20:31:49.9731342Z         T: int,
2025-05-07T20:31:49.9731420Z         D: int,
2025-05-07T20:31:49.9731519Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9731614Z         contiguous: bool,
2025-05-07T20:31:49.9731699Z         compiled: bool,
2025-05-07T20:31:49.9731778Z     ) -> None:
2025-05-07T20:31:49.9731875Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9731947Z     
2025-05-07T20:31:49.9732117Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9732192Z     
2025-05-07T20:31:49.9732281Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9732407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9732497Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9732576Z         x0 = x[:, :D]
2025-05-07T20:31:49.9732657Z         x1 = x[:, D:]
2025-05-07T20:31:49.9732728Z     
2025-05-07T20:31:49.9732821Z         if contiguous:
2025-05-07T20:31:49.9732914Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9733001Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9733074Z     
2025-05-07T20:31:49.9733168Z         if scale_ub is not None:
2025-05-07T20:31:49.9733271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9733406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9733488Z             )
2025-05-07T20:31:49.9733565Z         else:
2025-05-07T20:31:49.9733661Z             scale_ub_tensor = None
2025-05-07T20:31:49.9733732Z     
2025-05-07T20:31:49.9733861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9733951Z             op = silu_mul_quant
2025-05-07T20:31:49.9734038Z             if compiled:
2025-05-07T20:31:49.9734143Z                 op = torch.compile(op)
2025-05-07T20:31:49.9734271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9734354Z     
2025-05-07T20:31:49.9734458Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9734547Z 
2025-05-07T20:31:49.9734650Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9734779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9734882Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9734984Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9735488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9735587Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9735944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9736168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9736508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9736600Z     kernel = self.compile(
2025-05-07T20:31:49.9736996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9737175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9737300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9737304Z 
2025-05-07T20:31:49.9737510Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6949550>
2025-05-07T20:31:49.9738292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9738799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd696bc10>}
2025-05-07T20:31:49.9739612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9739810Z context = <triton._C.libtriton.ir.context object at 0x7f2fd67f10b0>
2025-05-07T20:31:49.9739814Z 
2025-05-07T20:31:49.9739986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9740504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9740619Z                            module_map=module_map)
2025-05-07T20:31:49.9740779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9740880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9740960Z E       ^
2025-05-07T20:31:49.9741364Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9741369Z 
2025-05-07T20:31:49.9741801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9741814Z 
2025-05-07T20:31:49.9741919Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9742139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9742221Z     T=2048,
2025-05-07T20:31:49.9742296Z     D=7168,
2025-05-07T20:31:49.9742378Z     scale_ub=1200.0,
2025-05-07T20:31:49.9742465Z     contiguous=True,
2025-05-07T20:31:49.9742547Z     compiled=False,
2025-05-07T20:31:49.9742620Z )
2025-05-07T20:31:49.9742840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9743012Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9743016Z 
2025-05-07T20:31:49.9743094Z     @given(
2025-05-07T20:31:49.9743211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9743308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9743571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9743687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9743801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9743878Z     )
2025-05-07T20:31:49.9744143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9744243Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9744342Z         self,
2025-05-07T20:31:49.9744420Z         T: int,
2025-05-07T20:31:49.9744501Z         D: int,
2025-05-07T20:31:49.9744600Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9744689Z         contiguous: bool,
2025-05-07T20:31:49.9744779Z         compiled: bool,
2025-05-07T20:31:49.9744856Z     ) -> None:
2025-05-07T20:31:49.9744949Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9745023Z     
2025-05-07T20:31:49.9745191Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9746952Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9746969Z 
2025-05-07T20:31:49.9747089Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9747093Z 
2025-05-07T20:31:49.9747194Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9747423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9747499Z     T=1,
2025-05-07T20:31:49.9747578Z     D=5120,
2025-05-07T20:31:49.9747662Z     scale_ub=1200.0,
2025-05-07T20:31:49.9747747Z     contiguous=True,
2025-05-07T20:31:49.9747951Z     compiled=False,
2025-05-07T20:31:49.9748026Z )
2025-05-07T20:31:49.9748241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9748411Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9748415Z 
2025-05-07T20:31:49.9748491Z     @given(
2025-05-07T20:31:49.9748607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9748710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9748824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9748940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9749059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9749133Z     )
2025-05-07T20:31:49.9749384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9749476Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9749553Z         self,
2025-05-07T20:31:49.9749642Z         T: int,
2025-05-07T20:31:49.9749720Z         D: int,
2025-05-07T20:31:49.9749818Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9749908Z         contiguous: bool,
2025-05-07T20:31:49.9749992Z         compiled: bool,
2025-05-07T20:31:49.9750069Z     ) -> None:
2025-05-07T20:31:49.9750164Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9750236Z     
2025-05-07T20:31:49.9750404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9750481Z     
2025-05-07T20:31:49.9750572Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9750700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9750789Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9750868Z         x0 = x[:, :D]
2025-05-07T20:31:49.9750958Z         x1 = x[:, D:]
2025-05-07T20:31:49.9751029Z     
2025-05-07T20:31:49.9751111Z         if contiguous:
2025-05-07T20:31:49.9751207Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9751297Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9751481Z     
2025-05-07T20:31:49.9751576Z         if scale_ub is not None:
2025-05-07T20:31:49.9751683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9751820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9751898Z             )
2025-05-07T20:31:49.9751976Z         else:
2025-05-07T20:31:49.9752074Z             scale_ub_tensor = None
2025-05-07T20:31:49.9752146Z     
2025-05-07T20:31:49.9752275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9752369Z             op = silu_mul_quant
2025-05-07T20:31:49.9752452Z             if compiled:
2025-05-07T20:31:49.9752552Z                 op = torch.compile(op)
2025-05-07T20:31:49.9752664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9752737Z     
2025-05-07T20:31:49.9752826Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9752830Z 
2025-05-07T20:31:49.9752933Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9753062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9753176Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9753281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9753788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9753886Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9754249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9754471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9754811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9754902Z     kernel = self.compile(
2025-05-07T20:31:49.9755289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9755549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9755675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9755679Z 
2025-05-07T20:31:49.9755890Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd6803640>
2025-05-07T20:31:49.9756656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9757154Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd66ef9d0>}
2025-05-07T20:31:49.9757901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9758099Z context = <triton._C.libtriton.ir.context object at 0x7f2fd68eb7b0>
2025-05-07T20:31:49.9758104Z 
2025-05-07T20:31:49.9758272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9758535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9758641Z                            module_map=module_map)
2025-05-07T20:31:49.9758807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9758907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9758982Z E       ^
2025-05-07T20:31:49.9759343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9759348Z 
2025-05-07T20:31:49.9759763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9759768Z 
2025-05-07T20:31:49.9759953Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9760175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9760250Z     T=2048,
2025-05-07T20:31:49.9760329Z     D=5120,
2025-05-07T20:31:49.9760413Z     scale_ub=None,
2025-05-07T20:31:49.9760497Z     contiguous=True,
2025-05-07T20:31:49.9760583Z     compiled=False,
2025-05-07T20:31:49.9760655Z )
2025-05-07T20:31:49.9760869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9761046Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9761051Z 
2025-05-07T20:31:49.9761126Z     @given(
2025-05-07T20:31:49.9761247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9761344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9761459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9761579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9761704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9761779Z     )
2025-05-07T20:31:49.9762031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9762125Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9762205Z         self,
2025-05-07T20:31:49.9762281Z         T: int,
2025-05-07T20:31:49.9762355Z         D: int,
2025-05-07T20:31:49.9762455Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9762543Z         contiguous: bool,
2025-05-07T20:31:49.9762626Z         compiled: bool,
2025-05-07T20:31:49.9762707Z     ) -> None:
2025-05-07T20:31:49.9762800Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9762871Z     
2025-05-07T20:31:49.9763044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9763118Z     
2025-05-07T20:31:49.9763209Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.9765100Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9765112Z 
2025-05-07T20:31:49.9765231Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.9765236Z 
2025-05-07T20:31:49.9765342Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9765566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9765648Z     T=16384,
2025-05-07T20:31:49.9765724Z     D=5120,
2025-05-07T20:31:49.9765806Z     scale_ub=None,
2025-05-07T20:31:49.9765892Z     contiguous=True,
2025-05-07T20:31:49.9765976Z     compiled=False,
2025-05-07T20:31:49.9766059Z )
2025-05-07T20:31:49.9766277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9766451Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9766456Z 
2025-05-07T20:31:49.9766532Z     @given(
2025-05-07T20:31:49.9766653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9766751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9766865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9766980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9767092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9767170Z     )
2025-05-07T20:31:49.9767413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9767506Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9767583Z         self,
2025-05-07T20:31:49.9767660Z         T: int,
2025-05-07T20:31:49.9767818Z         D: int,
2025-05-07T20:31:49.9767919Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9768011Z         contiguous: bool,
2025-05-07T20:31:49.9768097Z         compiled: bool,
2025-05-07T20:31:49.9768180Z     ) -> None:
2025-05-07T20:31:49.9768274Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9768349Z     
2025-05-07T20:31:49.9768518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9770297Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9770310Z 
2025-05-07T20:31:49.9770427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9770431Z 
2025-05-07T20:31:49.9770531Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9770756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9770832Z     T=4096,
2025-05-07T20:31:49.9770907Z     D=5120,
2025-05-07T20:31:49.9770990Z     scale_ub=None,
2025-05-07T20:31:49.9771072Z     contiguous=True,
2025-05-07T20:31:49.9771156Z     compiled=False,
2025-05-07T20:31:49.9771233Z )
2025-05-07T20:31:49.9771451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9771624Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9771629Z 
2025-05-07T20:31:49.9771704Z     @given(
2025-05-07T20:31:49.9771821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9771922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9772117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9772236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9772358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9772430Z     )
2025-05-07T20:31:49.9772677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9772772Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9772847Z         self,
2025-05-07T20:31:49.9772926Z         T: int,
2025-05-07T20:31:49.9773000Z         D: int,
2025-05-07T20:31:49.9773096Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9773185Z         contiguous: bool,
2025-05-07T20:31:49.9773269Z         compiled: bool,
2025-05-07T20:31:49.9773347Z     ) -> None:
2025-05-07T20:31:49.9773442Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9773514Z     
2025-05-07T20:31:49.9773680Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9775466Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9775472Z 
2025-05-07T20:31:49.9775587Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9775591Z 
2025-05-07T20:31:49.9775694Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9775914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9775991Z     T=2048,
2025-05-07T20:31:49.9776065Z     D=5120,
2025-05-07T20:31:49.9776146Z     scale_ub=None,
2025-05-07T20:31:49.9776316Z     contiguous=False,
2025-05-07T20:31:49.9776400Z     compiled=False,
2025-05-07T20:31:49.9776472Z )
2025-05-07T20:31:49.9776690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9776862Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9776866Z 
2025-05-07T20:31:49.9776940Z     @given(
2025-05-07T20:31:49.9777058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9777155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9777271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9777386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9777496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9777575Z     )
2025-05-07T20:31:49.9777823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9777915Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9777999Z         self,
2025-05-07T20:31:49.9778078Z         T: int,
2025-05-07T20:31:49.9778154Z         D: int,
2025-05-07T20:31:49.9778254Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9778343Z         contiguous: bool,
2025-05-07T20:31:49.9778428Z         compiled: bool,
2025-05-07T20:31:49.9778510Z     ) -> None:
2025-05-07T20:31:49.9778601Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9778676Z     
2025-05-07T20:31:49.9778844Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9780685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9780699Z 
2025-05-07T20:31:49.9780817Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9780821Z 
2025-05-07T20:31:49.9780921Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9781211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9781289Z     T=4096,
2025-05-07T20:31:49.9781363Z     D=7168,
2025-05-07T20:31:49.9781449Z     scale_ub=None,
2025-05-07T20:31:49.9781532Z     contiguous=True,
2025-05-07T20:31:49.9781613Z     compiled=True,
2025-05-07T20:31:49.9781689Z )
2025-05-07T20:31:49.9781904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9782074Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9782079Z 
2025-05-07T20:31:49.9782154Z     @given(
2025-05-07T20:31:49.9782271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9782384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9782499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9782613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9782728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9782800Z     )
2025-05-07T20:31:49.9783050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9783142Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9783217Z         self,
2025-05-07T20:31:49.9783294Z         T: int,
2025-05-07T20:31:49.9783369Z         D: int,
2025-05-07T20:31:49.9783467Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9783559Z         contiguous: bool,
2025-05-07T20:31:49.9783644Z         compiled: bool,
2025-05-07T20:31:49.9783722Z     ) -> None:
2025-05-07T20:31:49.9783818Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9783894Z     
2025-05-07T20:31:49.9784085Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9785979Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9785985Z 
2025-05-07T20:31:49.9786100Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9786104Z 
2025-05-07T20:31:49.9786208Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9786426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9786503Z     T=2048,
2025-05-07T20:31:49.9786583Z     D=5120,
2025-05-07T20:31:49.9786668Z     scale_ub=1200.0,
2025-05-07T20:31:49.9786755Z     contiguous=False,
2025-05-07T20:31:49.9786837Z     compiled=False,
2025-05-07T20:31:49.9786908Z )
2025-05-07T20:31:49.9787126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9787298Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9787303Z 
2025-05-07T20:31:49.9787379Z     @given(
2025-05-07T20:31:49.9787497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9787595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9787709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9787824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9787937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9788014Z     )
2025-05-07T20:31:49.9788262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9788439Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9788519Z         self,
2025-05-07T20:31:49.9788597Z         T: int,
2025-05-07T20:31:49.9788675Z         D: int,
2025-05-07T20:31:49.9788784Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9788871Z         contiguous: bool,
2025-05-07T20:31:49.9788955Z         compiled: bool,
2025-05-07T20:31:49.9789034Z     ) -> None:
2025-05-07T20:31:49.9789128Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9789202Z     
2025-05-07T20:31:49.9789368Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9791144Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9791158Z 
2025-05-07T20:31:49.9791275Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9791280Z 
2025-05-07T20:31:49.9791380Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9791601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9791678Z     T=4096,
2025-05-07T20:31:49.9791754Z     D=7168,
2025-05-07T20:31:49.9791837Z     scale_ub=1200.0,
2025-05-07T20:31:49.9791920Z     contiguous=True,
2025-05-07T20:31:49.9792002Z     compiled=False,
2025-05-07T20:31:49.9792076Z )
2025-05-07T20:31:49.9792290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9792466Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9792471Z 
2025-05-07T20:31:49.9792702Z     @given(
2025-05-07T20:31:49.9792829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9792929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9793042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9793156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9793270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9793343Z     )
2025-05-07T20:31:49.9793597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9793690Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9793765Z         self,
2025-05-07T20:31:49.9793848Z         T: int,
2025-05-07T20:31:49.9793923Z         D: int,
2025-05-07T20:31:49.9794019Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9794113Z         contiguous: bool,
2025-05-07T20:31:49.9794197Z         compiled: bool,
2025-05-07T20:31:49.9794275Z     ) -> None:
2025-05-07T20:31:49.9794373Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9794457Z     
2025-05-07T20:31:49.9794623Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9796373Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9796379Z 
2025-05-07T20:31:49.9796494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9796499Z 
2025-05-07T20:31:49.9796601Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9796825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9796987Z     T=16384,
2025-05-07T20:31:49.9797065Z     D=7168,
2025-05-07T20:31:49.9797145Z     scale_ub=None,
2025-05-07T20:31:49.9797234Z     contiguous=False,
2025-05-07T20:31:49.9797317Z     compiled=True,
2025-05-07T20:31:49.9797390Z )
2025-05-07T20:31:49.9797606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9797780Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.9797785Z 
2025-05-07T20:31:49.9797860Z     @given(
2025-05-07T20:31:49.9797977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9798073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9798189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9798305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9798417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9798492Z     )
2025-05-07T20:31:49.9798744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9798843Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9798922Z         self,
2025-05-07T20:31:49.9798998Z         T: int,
2025-05-07T20:31:49.9799074Z         D: int,
2025-05-07T20:31:49.9799176Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9799263Z         contiguous: bool,
2025-05-07T20:31:49.9799347Z         compiled: bool,
2025-05-07T20:31:49.9799428Z     ) -> None:
2025-05-07T20:31:49.9799521Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9799594Z     
2025-05-07T20:31:49.9799758Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9801540Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9801627Z 
2025-05-07T20:31:49.9801748Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9801752Z 
2025-05-07T20:31:49.9801852Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9802077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9802154Z     T=4096,
2025-05-07T20:31:49.9802229Z     D=7168,
2025-05-07T20:31:49.9802312Z     scale_ub=None,
2025-05-07T20:31:49.9802396Z     contiguous=True,
2025-05-07T20:31:49.9802479Z     compiled=False,
2025-05-07T20:31:49.9802555Z )
2025-05-07T20:31:49.9802770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9802947Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9802965Z 
2025-05-07T20:31:49.9803041Z     @given(
2025-05-07T20:31:49.9803159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9803258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9803370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9803484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9803597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9803670Z     )
2025-05-07T20:31:49.9803922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9804014Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9804091Z         self,
2025-05-07T20:31:49.9804187Z         T: int,
2025-05-07T20:31:49.9804269Z         D: int,
2025-05-07T20:31:49.9804385Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9804483Z         contiguous: bool,
2025-05-07T20:31:49.9804571Z         compiled: bool,
2025-05-07T20:31:49.9804653Z     ) -> None:
2025-05-07T20:31:49.9804823Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9804897Z     
2025-05-07T20:31:49.9805063Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9806814Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9806819Z 
2025-05-07T20:31:49.9806937Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9806945Z 
2025-05-07T20:31:49.9807044Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9807275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9807354Z     T=16384,
2025-05-07T20:31:49.9807429Z     D=7168,
2025-05-07T20:31:49.9807509Z     scale_ub=None,
2025-05-07T20:31:49.9807595Z     contiguous=True,
2025-05-07T20:31:49.9807678Z     compiled=False,
2025-05-07T20:31:49.9807749Z )
2025-05-07T20:31:49.9807970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9808147Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.9808152Z 
2025-05-07T20:31:49.9808227Z     @given(
2025-05-07T20:31:49.9808348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9808445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9808559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9808673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9808785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9808942Z     )
2025-05-07T20:31:49.9809191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9809283Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9809362Z         self,
2025-05-07T20:31:49.9809437Z         T: int,
2025-05-07T20:31:49.9809512Z         D: int,
2025-05-07T20:31:49.9809610Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9809696Z         contiguous: bool,
2025-05-07T20:31:49.9809780Z         compiled: bool,
2025-05-07T20:31:49.9809860Z     ) -> None:
2025-05-07T20:31:49.9809953Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9810029Z     
2025-05-07T20:31:49.9810197Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9811982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9811997Z 
2025-05-07T20:31:49.9812112Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9812117Z 
2025-05-07T20:31:49.9812217Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9812441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9812517Z     T=16384,
2025-05-07T20:31:49.9812591Z     D=7168,
2025-05-07T20:31:49.9812678Z     scale_ub=1200.0,
2025-05-07T20:31:49.9812764Z     contiguous=True,
2025-05-07T20:31:49.9812846Z     compiled=False,
2025-05-07T20:31:49.9812922Z )
2025-05-07T20:31:49.9813135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9813425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9813430Z 
2025-05-07T20:31:49.9813505Z     @given(
2025-05-07T20:31:49.9813622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9813724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9813836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9813950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9814065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9814148Z     )
2025-05-07T20:31:49.9814442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9814536Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9814612Z         self,
2025-05-07T20:31:49.9814689Z         T: int,
2025-05-07T20:31:49.9814763Z         D: int,
2025-05-07T20:31:49.9814859Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9814948Z         contiguous: bool,
2025-05-07T20:31:49.9815046Z         compiled: bool,
2025-05-07T20:31:49.9815122Z     ) -> None:
2025-05-07T20:31:49.9815218Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9815290Z     
2025-05-07T20:31:49.9815458Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9817240Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9817245Z 
2025-05-07T20:31:49.9817362Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9817449Z 
2025-05-07T20:31:49.9817556Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9817775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9817853Z     T=128,
2025-05-07T20:31:49.9817929Z     D=5120,
2025-05-07T20:31:49.9818012Z     scale_ub=1200.0,
2025-05-07T20:31:49.9818099Z     contiguous=False,
2025-05-07T20:31:49.9818181Z     compiled=False,
2025-05-07T20:31:49.9818252Z )
2025-05-07T20:31:49.9818475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9818647Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.9818652Z 
2025-05-07T20:31:49.9818727Z     @given(
2025-05-07T20:31:49.9818848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9818946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9819066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9819181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9819302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9819378Z     )
2025-05-07T20:31:49.9819621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9819715Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9819795Z         self,
2025-05-07T20:31:49.9819869Z         T: int,
2025-05-07T20:31:49.9819943Z         D: int,
2025-05-07T20:31:49.9820046Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9820134Z         contiguous: bool,
2025-05-07T20:31:49.9820219Z         compiled: bool,
2025-05-07T20:31:49.9820297Z     ) -> None:
2025-05-07T20:31:49.9820389Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9820465Z     
2025-05-07T20:31:49.9820630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9820703Z     
2025-05-07T20:31:49.9820796Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9820921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9821091Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9821225Z         x0 = x[:, :D]
2025-05-07T20:31:49.9821305Z         x1 = x[:, D:]
2025-05-07T20:31:49.9821377Z     
2025-05-07T20:31:49.9821462Z         if contiguous:
2025-05-07T20:31:49.9821554Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9821643Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9821719Z     
2025-05-07T20:31:49.9821807Z         if scale_ub is not None:
2025-05-07T20:31:49.9821914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9822050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9822125Z             )
2025-05-07T20:31:49.9822204Z         else:
2025-05-07T20:31:49.9822297Z             scale_ub_tensor = None
2025-05-07T20:31:49.9822368Z     
2025-05-07T20:31:49.9822502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9822592Z             op = silu_mul_quant
2025-05-07T20:31:49.9822678Z             if compiled:
2025-05-07T20:31:49.9822795Z                 op = torch.compile(op)
2025-05-07T20:31:49.9822899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9822970Z     
2025-05-07T20:31:49.9823061Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9823066Z 
2025-05-07T20:31:49.9823162Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9823291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9823391Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9823490Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9823992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9824087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9824443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9824670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9825101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9825196Z     kernel = self.compile(
2025-05-07T20:31:49.9825575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9825751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9825879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9825883Z 
2025-05-07T20:31:49.9826091Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd66834c0>
2025-05-07T20:31:49.9826871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9827377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd65c6670>}
2025-05-07T20:31:49.9828130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9828324Z context = <triton._C.libtriton.ir.context object at 0x7f2fd65610f0>
2025-05-07T20:31:49.9828329Z 
2025-05-07T20:31:49.9828493Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9828767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9828874Z                            module_map=module_map)
2025-05-07T20:31:49.9829037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9829138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9829213Z E       ^
2025-05-07T20:31:49.9829648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9829656Z 
2025-05-07T20:31:49.9830073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9830077Z 
2025-05-07T20:31:49.9830181Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9830406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9830486Z     T=2048,
2025-05-07T20:31:49.9830562Z     D=7168,
2025-05-07T20:31:49.9830647Z     scale_ub=None,
2025-05-07T20:31:49.9830733Z     contiguous=False,
2025-05-07T20:31:49.9830814Z     compiled=False,
2025-05-07T20:31:49.9830888Z )
2025-05-07T20:31:49.9831105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9831285Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.9831294Z 
2025-05-07T20:31:49.9831374Z     @given(
2025-05-07T20:31:49.9831493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9831596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9831709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9831822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9831939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9832011Z     )
2025-05-07T20:31:49.9832263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9832355Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9832431Z         self,
2025-05-07T20:31:49.9832511Z         T: int,
2025-05-07T20:31:49.9832586Z         D: int,
2025-05-07T20:31:49.9832681Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9832773Z         contiguous: bool,
2025-05-07T20:31:49.9832857Z         compiled: bool,
2025-05-07T20:31:49.9832933Z     ) -> None:
2025-05-07T20:31:49.9833035Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9833208Z     
2025-05-07T20:31:49.9833377Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9835128Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9835134Z 
2025-05-07T20:31:49.9835251Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9835258Z 
2025-05-07T20:31:49.9835358Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9835587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9835676Z     T=128,
2025-05-07T20:31:49.9835753Z     D=7168,
2025-05-07T20:31:49.9835836Z     scale_ub=1200.0,
2025-05-07T20:31:49.9835927Z     contiguous=True,
2025-05-07T20:31:49.9836008Z     compiled=True,
2025-05-07T20:31:49.9836080Z )
2025-05-07T20:31:49.9836299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9836466Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9836471Z 
2025-05-07T20:31:49.9836545Z     @given(
2025-05-07T20:31:49.9836663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9836761Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9836875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9836991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9837102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9837180Z     )
2025-05-07T20:31:49.9837511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9837606Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9837687Z         self,
2025-05-07T20:31:49.9837764Z         T: int,
2025-05-07T20:31:49.9837838Z         D: int,
2025-05-07T20:31:49.9837939Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9838026Z         contiguous: bool,
2025-05-07T20:31:49.9838112Z         compiled: bool,
2025-05-07T20:31:49.9838190Z     ) -> None:
2025-05-07T20:31:49.9838282Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9838356Z     
2025-05-07T20:31:49.9838524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9838599Z     
2025-05-07T20:31:49.9838692Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9838817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9838905Z         x = x_sign * x_clamp
2025-05-07T20:31:49.9838987Z         x0 = x[:, :D]
2025-05-07T20:31:49.9839076Z         x1 = x[:, D:]
2025-05-07T20:31:49.9839147Z     
2025-05-07T20:31:49.9839234Z         if contiguous:
2025-05-07T20:31:49.9839325Z             x0 = x0.contiguous()
2025-05-07T20:31:49.9839414Z             x1 = x1.contiguous()
2025-05-07T20:31:49.9839489Z     
2025-05-07T20:31:49.9839578Z         if scale_ub is not None:
2025-05-07T20:31:49.9839684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.9839818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.9839892Z             )
2025-05-07T20:31:49.9839970Z         else:
2025-05-07T20:31:49.9840196Z             scale_ub_tensor = None
2025-05-07T20:31:49.9840270Z     
2025-05-07T20:31:49.9840405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.9840493Z             op = silu_mul_quant
2025-05-07T20:31:49.9840578Z             if compiled:
2025-05-07T20:31:49.9840680Z                 op = torch.compile(op)
2025-05-07T20:31:49.9840790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9841018Z     
2025-05-07T20:31:49.9841113Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.9841118Z 
2025-05-07T20:31:49.9841214Z moe/activation_test.py:117: 
2025-05-07T20:31:49.9841346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9841449Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.9841553Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.9846476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.9846588Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.9847094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.9847192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.9847557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.9847793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.9848136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.9848231Z     kernel = self.compile(
2025-05-07T20:31:49.9848609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.9848784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9848918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.9848924Z 
2025-05-07T20:31:49.9849135Z self = <triton.compiler.compiler.ASTSource object at 0x7f2fd65370a0>
2025-05-07T20:31:49.9849905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.9850552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f316b4875e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2fd659b5e0>}
2025-05-07T20:31:49.9851305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.9851495Z context = <triton._C.libtriton.ir.context object at 0x7f2fd64e2970>
2025-05-07T20:31:49.9851500Z 
2025-05-07T20:31:49.9851666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.9851930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9852036Z                            module_map=module_map)
2025-05-07T20:31:49.9852197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9852302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9852389Z E       ^
2025-05-07T20:31:49.9852746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9852751Z 
2025-05-07T20:31:49.9853158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9853162Z 
2025-05-07T20:31:49.9853263Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9853489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9853564Z     T=128,
2025-05-07T20:31:49.9853641Z     D=7168,
2025-05-07T20:31:49.9853726Z     scale_ub=1200.0,
2025-05-07T20:31:49.9853808Z     contiguous=True,
2025-05-07T20:31:49.9853892Z     compiled=False,
2025-05-07T20:31:49.9853963Z )
2025-05-07T20:31:49.9854178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9854359Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.9854443Z 
2025-05-07T20:31:49.9854520Z     @given(
2025-05-07T20:31:49.9854637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9854738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9854852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9854968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9855080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9855153Z     )
2025-05-07T20:31:49.9855404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9855497Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9855572Z         self,
2025-05-07T20:31:49.9855655Z         T: int,
2025-05-07T20:31:49.9855731Z         D: int,
2025-05-07T20:31:49.9855830Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9855925Z         contiguous: bool,
2025-05-07T20:31:49.9856009Z         compiled: bool,
2025-05-07T20:31:49.9856097Z     ) -> None:
2025-05-07T20:31:49.9856192Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9856268Z     
2025-05-07T20:31:49.9856442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9856516Z     
2025-05-07T20:31:49.9856607Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.9856735Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.9858482Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9858493Z 
2025-05-07T20:31:49.9858688Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.9858693Z 
2025-05-07T20:31:49.9858794Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9859016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9859098Z     T=128,
2025-05-07T20:31:49.9859173Z     D=5120,
2025-05-07T20:31:49.9859253Z     scale_ub=1200.0,
2025-05-07T20:31:49.9859344Z     contiguous=True,
2025-05-07T20:31:49.9859425Z     compiled=True,
2025-05-07T20:31:49.9859500Z )
2025-05-07T20:31:49.9859716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9859883Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.9859887Z 
2025-05-07T20:31:49.9859965Z     @given(
2025-05-07T20:31:49.9860082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9860179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9860300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9860420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9860534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9860610Z     )
2025-05-07T20:31:49.9860853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9860947Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9861022Z         self,
2025-05-07T20:31:49.9861097Z         T: int,
2025-05-07T20:31:49.9861239Z         D: int,
2025-05-07T20:31:49.9861338Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9861425Z         contiguous: bool,
2025-05-07T20:31:49.9861512Z         compiled: bool,
2025-05-07T20:31:49.9861588Z     ) -> None:
2025-05-07T20:31:49.9861681Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9861757Z     
2025-05-07T20:31:49.9861923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9861997Z     
2025-05-07T20:31:49.9862095Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.9863946Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9863952Z 
2025-05-07T20:31:49.9864074Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.9864079Z 
2025-05-07T20:31:49.9864179Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.9864403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.9864480Z     T=128,
2025-05-07T20:31:49.9864554Z     D=7168,
2025-05-07T20:31:49.9864645Z     scale_ub=None,
2025-05-07T20:31:49.9864729Z     contiguous=True,
2025-05-07T20:31:49.9864811Z     compiled=True,
2025-05-07T20:31:49.9864885Z )
2025-05-07T20:31:49.9865099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.9865264Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.9865272Z 
2025-05-07T20:31:49.9865349Z     @given(
2025-05-07T20:31:49.9865467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.9865567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.9865679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.9865793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.9865906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.9865980Z     )
2025-05-07T20:31:49.9866223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.9866319Z     def test_silu_mul_quant(
2025-05-07T20:31:49.9866476Z         self,
2025-05-07T20:31:49.9866557Z         T: int,
2025-05-07T20:31:49.9866635Z         D: int,
2025-05-07T20:31:49.9866730Z         scale_ub: Optional[float],
2025-05-07T20:31:49.9866820Z         contiguous: bool,
2025-05-07T20:31:49.9866906Z         compiled: bool,
2025-05-07T20:31:49.9866983Z     ) -> None:
2025-05-07T20:31:49.9867080Z         torch.manual_seed(2025)
2025-05-07T20:31:49.9867153Z     
2025-05-07T20:31:49.9867319Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.9869061Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.9869071Z 
2025-05-07T20:31:49.9869189Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.9869325Z =============================== warnings summary ===============================
2025-05-07T20:31:49.9869636Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.9869932Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.9870228Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.9871108Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:49.9871417Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:49.9871422Z 
2025-05-07T20:31:49.9871602Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:49.9872888Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:49.9873078Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:49.9873082Z 
2025-05-07T20:31:49.9873292Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:49.9873459Z ================== 1 failed, 1 passed, 13 warnings in 33.14s ===================
2025-05-07T20:31:51.7274847Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:51.7914319Z 
2025-05-07T20:31:51.7915251Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:51.7915640Z 
2025-05-07T20:31:51.7915645Z 
2025-05-07T20:31:51.7934714Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:53.9547036Z ============================= test session starts ==============================
2025-05-07T20:31:53.9547701Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:53.9548236Z cachedir: .pytest_cache
2025-05-07T20:31:53.9549177Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:53.9549946Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:53.9550369Z plugins: hypothesis-6.131.14
2025-05-07T20:31:55.5644852Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:55.7770395Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:55.7770807Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:55.7771043Z 
2025-05-07T20:31:57.9873717Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.9874804Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:57.9876198Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.9877656Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.9879049Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.9880437Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.9881756Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.9883427Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.9884857Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.9886120Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:57.9887360Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.9888599Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:57.9889643Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:57.9890684Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:57.9891921Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.9893224Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.9894522Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:57.9895583Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:57.9896788Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.9898157Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.9899236Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.9900151Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.9900908Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:57.9902043Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.0046018Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.0047156Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:58.0048528Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.0050456Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.0052035Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.0053435Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0054764Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.0056163Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0057639Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.0058891Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:58.0060120Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.0061667Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:58.0062718Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:58.0063742Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:58.0064979Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.0066258Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.0067389Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:58.0068448Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:58.0069615Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.0070982Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.0072055Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0072983Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0073819Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:58.0074865Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.6534342Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.6535302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.6535907Z     T=1,
2025-05-07T20:31:58.6536104Z     D=5120,
2025-05-07T20:31:58.6536318Z     scale_ub=None,
2025-05-07T20:31:58.6536546Z     contiguous=True,
2025-05-07T20:31:58.6536777Z     compiled=True,
2025-05-07T20:31:58.6537000Z )
2025-05-07T20:31:58.6537338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.6537871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.6538141Z 
2025-05-07T20:31:58.6538224Z     @given(
2025-05-07T20:31:58.6538470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.6538794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.6539106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.6539446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.6539788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.6540468Z     )
2025-05-07T20:31:58.6540923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.6541486Z     def test_silu_mul_quant(
2025-05-07T20:31:58.6541742Z         self,
2025-05-07T20:31:58.6541940Z         T: int,
2025-05-07T20:31:58.6542147Z         D: int,
2025-05-07T20:31:58.6542379Z         scale_ub: Optional[float],
2025-05-07T20:31:58.6542656Z         contiguous: bool,
2025-05-07T20:31:58.6543330Z         compiled: bool,
2025-05-07T20:31:58.6543579Z     ) -> None:
2025-05-07T20:31:58.6543809Z         torch.manual_seed(2025)
2025-05-07T20:31:58.6544060Z     
2025-05-07T20:31:58.6544347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.6544702Z     
2025-05-07T20:31:58.6544900Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.6545203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.6545527Z         x = x_sign * x_clamp
2025-05-07T20:31:58.6545775Z         x0 = x[:, :D]
2025-05-07T20:31:58.6546004Z         x1 = x[:, D:]
2025-05-07T20:31:58.6546223Z     
2025-05-07T20:31:58.6546416Z         if contiguous:
2025-05-07T20:31:58.6546671Z             x0 = x0.contiguous()
2025-05-07T20:31:58.6546982Z             x1 = x1.contiguous()
2025-05-07T20:31:58.6547229Z     
2025-05-07T20:31:58.6547438Z         if scale_ub is not None:
2025-05-07T20:31:58.6547728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.6548089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.6548407Z             )
2025-05-07T20:31:58.6548611Z         else:
2025-05-07T20:31:58.6548839Z             scale_ub_tensor = None
2025-05-07T20:31:58.6549097Z     
2025-05-07T20:31:58.6549345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.6549674Z             op = silu_mul_quant
2025-05-07T20:31:58.6549940Z             if compiled:
2025-05-07T20:31:58.6550209Z                 op = torch.compile(op)
2025-05-07T20:31:58.6550523Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.6550808Z     
2025-05-07T20:31:58.6551013Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.6551314Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.6551610Z     
2025-05-07T20:31:58.6551860Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.6552210Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.6552509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.6553016Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.6553392Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.6553716Z     
2025-05-07T20:31:58.6553923Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.6554128Z 
2025-05-07T20:31:58.6554234Z moe/activation_test.py:126: 
2025-05-07T20:31:58.6554547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.6554890Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.6555229Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.6556030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.6556805Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.6557358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.6558061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.6558762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.6559494Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.6560258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.6561014Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.6561746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.6562389Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.6563001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.6563625Z     fn()
2025-05-07T20:31:58.6564141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.6564724Z     self.fn.run(
2025-05-07T20:31:58.6565199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.6565745Z     kernel = self.compile(
2025-05-07T20:31:58.6566290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.6566990Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.6567411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.6567654Z 
2025-05-07T20:31:58.6567874Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7edeba040>
2025-05-07T20:31:58.6568964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.6570370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7edee7040>}
2025-05-07T20:31:58.6571726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.6572754Z context = <triton._C.libtriton.ir.context object at 0x7fd7ee9b7870>
2025-05-07T20:31:58.6573048Z 
2025-05-07T20:31:58.6573238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.6573763Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.6574254Z                            module_map=module_map)
2025-05-07T20:31:58.6574728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.6575097Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.6575383Z E       ^
2025-05-07T20:31:58.6575857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.6576309Z 
2025-05-07T20:31:58.6576749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.6577304Z 
2025-05-07T20:31:58.6577413Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.6577847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.6578261Z     T=2048,
2025-05-07T20:31:58.6578454Z     D=5120,
2025-05-07T20:31:58.6578658Z     scale_ub=1200.0,
2025-05-07T20:31:58.6578894Z     contiguous=True,
2025-05-07T20:31:58.6579125Z     compiled=False,
2025-05-07T20:31:58.6579343Z )
2025-05-07T20:31:59.7052647Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.7053757Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:59.7055100Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.7056562Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.7058267Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.7059671Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.7060988Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.7062471Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.7063906Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.7065175Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:59.7066412Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.7067627Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:59.7068682Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:59.7069713Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:59.7071109Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.7072407Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.7073538Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:59.7074598Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:59.7075791Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.7077176Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.7078249Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.7079182Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.7079946Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:59.7080991Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.9382561Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.9383674Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:59.9385035Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.9386494Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.9387903Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.9389317Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.9390654Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.9392059Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.9393501Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.9394933Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:59.9396177Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.9397398Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:59.9398464Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:59.9399510Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:59.9400764Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.9402083Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.9403229Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:59.9404304Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:59.9405507Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.9406981Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.9408075Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.9409001Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.9409768Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:59.9410806Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7832019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7832914Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.7833332Z 
2025-05-07T20:32:00.7833443Z     @given(
2025-05-07T20:32:00.7833768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7834199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7834539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7834897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7835241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7835549Z     )
2025-05-07T20:32:00.7835923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7836374Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7836633Z         self,
2025-05-07T20:32:00.7836844Z         T: int,
2025-05-07T20:32:00.7837050Z         D: int,
2025-05-07T20:32:00.7837288Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7837606Z         contiguous: bool,
2025-05-07T20:32:00.7837883Z         compiled: bool,
2025-05-07T20:32:00.7838557Z     ) -> None:
2025-05-07T20:32:00.7838789Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7839051Z     
2025-05-07T20:32:00.7839332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7839690Z     
2025-05-07T20:32:00.7839900Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7840396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7840727Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7840983Z         x0 = x[:, :D]
2025-05-07T20:32:00.7841205Z         x1 = x[:, D:]
2025-05-07T20:32:00.7841427Z     
2025-05-07T20:32:00.7841629Z         if contiguous:
2025-05-07T20:32:00.7841867Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7842141Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7842394Z     
2025-05-07T20:32:00.7842591Z         if scale_ub is not None:
2025-05-07T20:32:00.7842878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7843234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7843568Z             )
2025-05-07T20:32:00.7843768Z         else:
2025-05-07T20:32:00.7843993Z             scale_ub_tensor = None
2025-05-07T20:32:00.7844260Z     
2025-05-07T20:32:00.7844499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7844845Z             op = silu_mul_quant
2025-05-07T20:32:00.7845119Z             if compiled:
2025-05-07T20:32:00.7845385Z                 op = torch.compile(op)
2025-05-07T20:32:00.7845691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7855756Z     
2025-05-07T20:32:00.7855988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.7856180Z 
2025-05-07T20:32:00.7856288Z moe/activation_test.py:117: 
2025-05-07T20:32:00.7856606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7856948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.7857246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7858158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.7858868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.7859430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7860129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7860809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7861433Z     kernel = self.compile(
2025-05-07T20:32:00.7861995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7862661Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7863068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7863319Z 
2025-05-07T20:32:00.7863540Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7cc137d00>
2025-05-07T20:32:00.7864632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7866113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7cbe6e5e0>}
2025-05-07T20:32:00.7867489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7868532Z context = <triton._C.libtriton.ir.context object at 0x7fd7ecf54630>
2025-05-07T20:32:00.7868827Z 
2025-05-07T20:32:00.7869006Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7869676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7870161Z                            module_map=module_map)
2025-05-07T20:32:00.7870535Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7870910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.7871186Z E       ^
2025-05-07T20:32:00.7871660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7872120Z 
2025-05-07T20:32:00.7872541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7873063Z 
2025-05-07T20:32:00.7873170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7873599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7874010Z     T=2048,
2025-05-07T20:32:00.7874207Z     D=5120,
2025-05-07T20:32:00.7874420Z     scale_ub=1200.0,
2025-05-07T20:32:00.7874656Z     contiguous=True,
2025-05-07T20:32:00.7874881Z     compiled=True,
2025-05-07T20:32:00.7875099Z )
2025-05-07T20:32:00.7875429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7875931Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.7876215Z 
2025-05-07T20:32:00.7876299Z     @given(
2025-05-07T20:32:00.7876546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7876864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7877183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7877521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7877865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7878161Z     )
2025-05-07T20:32:00.7878520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7879058Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7879310Z         self,
2025-05-07T20:32:00.7879517Z         T: int,
2025-05-07T20:32:00.7879724Z         D: int,
2025-05-07T20:32:00.7879947Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7880235Z         contiguous: bool,
2025-05-07T20:32:00.7880486Z         compiled: bool,
2025-05-07T20:32:00.7880712Z     ) -> None:
2025-05-07T20:32:00.7880945Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7881208Z     
2025-05-07T20:32:00.7881486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7881842Z     
2025-05-07T20:32:00.7882046Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7882342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7882669Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7882927Z         x0 = x[:, :D]
2025-05-07T20:32:00.7883152Z         x1 = x[:, D:]
2025-05-07T20:32:00.7883363Z     
2025-05-07T20:32:00.7883559Z         if contiguous:
2025-05-07T20:32:00.7883809Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7884075Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7884328Z     
2025-05-07T20:32:00.7884533Z         if scale_ub is not None:
2025-05-07T20:32:00.7884812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7885163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7885493Z             )
2025-05-07T20:32:00.7885686Z         else:
2025-05-07T20:32:00.7885907Z             scale_ub_tensor = None
2025-05-07T20:32:00.7886170Z     
2025-05-07T20:32:00.7886405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7886734Z             op = silu_mul_quant
2025-05-07T20:32:00.7887002Z             if compiled:
2025-05-07T20:32:00.7887254Z                 op = torch.compile(op)
2025-05-07T20:32:00.7887563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7887853Z     
2025-05-07T20:32:00.7888049Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.7888445Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.7888750Z     
2025-05-07T20:32:00.7889001Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7889345Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.7889651Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.7889979Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.7890343Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7890663Z     
2025-05-07T20:32:00.7890872Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.7891068Z 
2025-05-07T20:32:00.7891173Z moe/activation_test.py:126: 
2025-05-07T20:32:00.7891485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7891836Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.7892183Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7892979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.7893757Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.7894326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7895010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7895715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.7896446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7897212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.7897963Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7898785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.7899445Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.7900069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.7900599Z     fn()
2025-05-07T20:32:00.7901188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.7901789Z     self.fn.run(
2025-05-07T20:32:00.7902265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7902814Z     kernel = self.compile(
2025-05-07T20:32:00.7903373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7904043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7904470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7904705Z 
2025-05-07T20:32:00.7904926Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecb2c070>
2025-05-07T20:32:00.7906009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7907402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eca155e0>}
2025-05-07T20:32:00.7908794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7909817Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec849bb0>
2025-05-07T20:32:00.7910202Z 
2025-05-07T20:32:00.7910385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7910914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7911392Z                            module_map=module_map)
2025-05-07T20:32:00.7911769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7912132Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.7912399Z E       ^
2025-05-07T20:32:00.7912864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7913314Z 
2025-05-07T20:32:00.7913736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7914248Z 
2025-05-07T20:32:00.7914358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7914778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7915191Z     T=16384,
2025-05-07T20:32:00.7915391Z     D=7168,
2025-05-07T20:32:00.7915586Z     scale_ub=1200.0,
2025-05-07T20:32:00.7915817Z     contiguous=False,
2025-05-07T20:32:00.7916053Z     compiled=False,
2025-05-07T20:32:00.7916260Z )
2025-05-07T20:32:01.4108374Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.4109677Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:01.4111023Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.4112652Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.4114061Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.4115447Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.4116775Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.4118228Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.4119656Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.4120915Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.4122159Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.4123370Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:01.4124420Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:01.4125573Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:01.4126785Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.4128082Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.4129214Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:01.4130271Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:01.4131468Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.4132824Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.4133903Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.4134825Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.4135569Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:01.4136669Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.5858862Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.5860146Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:01.5861565Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.5863007Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.5864415Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.5865791Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.5867105Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.5868493Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.5870096Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.5871360Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.5872576Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.5873797Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:01.5874859Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:01.5875909Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:01.5877132Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.5878407Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.5879538Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:01.5880693Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:01.5881887Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.5883252Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.5884323Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.5885247Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.5886005Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:01.5887044Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7333719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7334303Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.7334601Z 
2025-05-07T20:32:02.7334712Z     @given(
2025-05-07T20:32:02.7335054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7335514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7335974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7336347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7336696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7337003Z     )
2025-05-07T20:32:02.7337367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7338047Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7338343Z         self,
2025-05-07T20:32:02.7338542Z         T: int,
2025-05-07T20:32:02.7338750Z         D: int,
2025-05-07T20:32:02.7338979Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7339253Z         contiguous: bool,
2025-05-07T20:32:02.7339503Z         compiled: bool,
2025-05-07T20:32:02.7339738Z     ) -> None:
2025-05-07T20:32:02.7339963Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7340397Z     
2025-05-07T20:32:02.7340684Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7341036Z     
2025-05-07T20:32:02.7341299Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7341606Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7341929Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7342175Z         x0 = x[:, :D]
2025-05-07T20:32:02.7342405Z         x1 = x[:, D:]
2025-05-07T20:32:02.7342623Z     
2025-05-07T20:32:02.7342817Z         if contiguous:
2025-05-07T20:32:02.7343070Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7343342Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7343595Z     
2025-05-07T20:32:02.7343799Z         if scale_ub is not None:
2025-05-07T20:32:02.7344082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7344426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7344749Z             )
2025-05-07T20:32:02.7344952Z         else:
2025-05-07T20:32:02.7345166Z             scale_ub_tensor = None
2025-05-07T20:32:02.7345430Z     
2025-05-07T20:32:02.7345669Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7345993Z             op = silu_mul_quant
2025-05-07T20:32:02.7346285Z             if compiled:
2025-05-07T20:32:02.7346536Z                 op = torch.compile(op)
2025-05-07T20:32:02.7346845Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7347131Z     
2025-05-07T20:32:02.7347324Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.7347500Z 
2025-05-07T20:32:02.7347759Z moe/activation_test.py:117: 
2025-05-07T20:32:02.7348073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7348415Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.7348712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7349460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.7350161Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.7350705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7351406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7352079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7352619Z     kernel = self.compile(
2025-05-07T20:32:02.7353170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7353836Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7354242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7354475Z 
2025-05-07T20:32:02.7354694Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec97bcd0>
2025-05-07T20:32:02.7355771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7357181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec7e61f0>}
2025-05-07T20:32:02.7358598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7359761Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec7e20f0>
2025-05-07T20:32:02.7360057Z 
2025-05-07T20:32:02.7360235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7360759Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7361233Z                            module_map=module_map)
2025-05-07T20:32:02.7361610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7361975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.7362238Z E       ^
2025-05-07T20:32:02.7362717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7363167Z 
2025-05-07T20:32:02.7363602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7364129Z 
2025-05-07T20:32:02.7364239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7364665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7365075Z     T=1,
2025-05-07T20:32:02.7365269Z     D=7168,
2025-05-07T20:32:02.7365467Z     scale_ub=None,
2025-05-07T20:32:02.7365695Z     contiguous=True,
2025-05-07T20:32:02.7365929Z     compiled=True,
2025-05-07T20:32:02.7366137Z )
2025-05-07T20:32:02.7366468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7366967Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.7367230Z 
2025-05-07T20:32:02.7367313Z     @given(
2025-05-07T20:32:02.7367559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7367885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7368196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7368629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7368972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7369269Z     )
2025-05-07T20:32:02.7369620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7370069Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7370324Z         self,
2025-05-07T20:32:02.7370522Z         T: int,
2025-05-07T20:32:02.7370730Z         D: int,
2025-05-07T20:32:02.7370957Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7371232Z         contiguous: bool,
2025-05-07T20:32:02.7371480Z         compiled: bool,
2025-05-07T20:32:02.7371714Z     ) -> None:
2025-05-07T20:32:02.7371937Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7372190Z     
2025-05-07T20:32:02.7372471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7372828Z     
2025-05-07T20:32:02.7373024Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7373337Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7373661Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7373914Z         x0 = x[:, :D]
2025-05-07T20:32:02.7374133Z         x1 = x[:, D:]
2025-05-07T20:32:02.7374353Z     
2025-05-07T20:32:02.7374549Z         if contiguous:
2025-05-07T20:32:02.7374783Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7375052Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7375307Z     
2025-05-07T20:32:02.7375503Z         if scale_ub is not None:
2025-05-07T20:32:02.7375787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7376136Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7376451Z             )
2025-05-07T20:32:02.7376654Z         else:
2025-05-07T20:32:02.7376877Z             scale_ub_tensor = None
2025-05-07T20:32:02.7377131Z     
2025-05-07T20:32:02.7377372Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7377695Z             op = silu_mul_quant
2025-05-07T20:32:02.7378045Z             if compiled:
2025-05-07T20:32:02.7378307Z                 op = torch.compile(op)
2025-05-07T20:32:02.7378610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7378895Z     
2025-05-07T20:32:02.7379091Z         y_fp8, y_scale = fn()
2025-05-07T20:32:02.7379386Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:02.7379684Z     
2025-05-07T20:32:02.7379922Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7380268Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:02.7380573Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:02.7380894Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:02.7381317Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.7381633Z     
2025-05-07T20:32:02.7381837Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:02.7382043Z 
2025-05-07T20:32:02.7382146Z moe/activation_test.py:126: 
2025-05-07T20:32:02.7382461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7382813Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:02.7383150Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.7383939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:02.7384697Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:02.7385246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7385941Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7386637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:02.7387363Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.7388256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:02.7389015Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.7389747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:02.7390392Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:02.7390996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:02.7391522Z     fn()
2025-05-07T20:32:02.7392039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:02.7392618Z     self.fn.run(
2025-05-07T20:32:02.7393097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7393649Z     kernel = self.compile(
2025-05-07T20:32:02.7394193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7394846Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7395259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7395493Z 
2025-05-07T20:32:02.7395710Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecf50040>
2025-05-07T20:32:02.7396794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7398183Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec7e6790>}
2025-05-07T20:32:02.7399619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7400639Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1dd8f0>
2025-05-07T20:32:02.7400929Z 
2025-05-07T20:32:02.7401105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7401627Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7402096Z                            module_map=module_map)
2025-05-07T20:32:02.7402470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7402838Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:02.7403106Z E       ^
2025-05-07T20:32:02.7403572Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7404028Z 
2025-05-07T20:32:02.7404455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7404977Z 
2025-05-07T20:32:02.7405088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7405503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7405910Z     T=4096,
2025-05-07T20:32:02.7406102Z     D=5120,
2025-05-07T20:32:02.7406294Z     scale_ub=None,
2025-05-07T20:32:02.7406516Z     contiguous=False,
2025-05-07T20:32:02.7406754Z     compiled=False,
2025-05-07T20:32:02.7406979Z )
2025-05-07T20:32:03.4117964Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.4119113Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:03.4121011Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.4122433Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.4123800Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.4125182Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.4126497Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.4127885Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.4129313Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.4130565Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:03.4131807Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.4133161Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:03.4134198Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.4135219Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:03.4136600Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.4137913Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.4139110Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.4140410Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:03.4141649Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.4143020Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.4144246Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.4145184Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.4145949Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:03.4146989Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0842026Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.0844007Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:04.0846457Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.0849115Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.0850505Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.0851901Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0853237Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.0854828Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0856261Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.0857528Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:04.0858994Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.0860249Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:04.0861351Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:04.0862393Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:04.0863634Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.0864932Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.0866211Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.0867273Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:04.0868471Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.0869850Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.0870931Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0871872Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0872620Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:04.0873659Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.3795848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.3796390Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.3796696Z 
2025-05-07T20:32:05.3796800Z     @given(
2025-05-07T20:32:05.3797063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.3797411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.3797749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.3798309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.3798664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.3798981Z     )
2025-05-07T20:32:05.3799344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.3799846Z     def test_silu_mul_quant(
2025-05-07T20:32:05.3800104Z         self,
2025-05-07T20:32:05.3800330Z         T: int,
2025-05-07T20:32:05.3800550Z         D: int,
2025-05-07T20:32:05.3800785Z         scale_ub: Optional[float],
2025-05-07T20:32:05.3801081Z         contiguous: bool,
2025-05-07T20:32:05.3801340Z         compiled: bool,
2025-05-07T20:32:05.3801577Z     ) -> None:
2025-05-07T20:32:05.3801820Z         torch.manual_seed(2025)
2025-05-07T20:32:05.3802086Z     
2025-05-07T20:32:05.3802372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.3802742Z     
2025-05-07T20:32:05.3802958Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.3803276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.3803611Z         x = x_sign * x_clamp
2025-05-07T20:32:05.3803871Z         x0 = x[:, :D]
2025-05-07T20:32:05.3804104Z         x1 = x[:, D:]
2025-05-07T20:32:05.3804318Z     
2025-05-07T20:32:05.3804517Z         if contiguous:
2025-05-07T20:32:05.3804767Z             x0 = x0.contiguous()
2025-05-07T20:32:05.3805028Z             x1 = x1.contiguous()
2025-05-07T20:32:05.3805283Z     
2025-05-07T20:32:05.3805489Z         if scale_ub is not None:
2025-05-07T20:32:05.3805764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.3806115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.3806443Z             )
2025-05-07T20:32:05.3806644Z         else:
2025-05-07T20:32:05.3806872Z             scale_ub_tensor = None
2025-05-07T20:32:05.3807139Z     
2025-05-07T20:32:05.3807378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.3807702Z             op = silu_mul_quant
2025-05-07T20:32:05.3808099Z             if compiled:
2025-05-07T20:32:05.3808361Z                 op = torch.compile(op)
2025-05-07T20:32:05.3808675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3808975Z     
2025-05-07T20:32:05.3809192Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.3809395Z 
2025-05-07T20:32:05.3809521Z moe/activation_test.py:117: 
2025-05-07T20:32:05.3809840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3810185Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.3810473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3811190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.3811895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.3812447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.3813144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.3813818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.3814353Z     kernel = self.compile(
2025-05-07T20:32:05.3814906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.3815570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.3815972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3816217Z 
2025-05-07T20:32:05.3816435Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecb2c0a0>
2025-05-07T20:32:05.3817519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.3819022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec3e9550>}
2025-05-07T20:32:05.3820398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.3821525Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1ff530>
2025-05-07T20:32:05.3821821Z 
2025-05-07T20:32:05.3821995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.3822525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.3822997Z                            module_map=module_map)
2025-05-07T20:32:05.3823365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.3823728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.3824006Z E       ^
2025-05-07T20:32:05.3824468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.3824927Z 
2025-05-07T20:32:05.3825351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.3825868Z 
2025-05-07T20:32:05.3825975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.3826397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.3826800Z     T=4096,
2025-05-07T20:32:05.3826996Z     D=7168,
2025-05-07T20:32:05.3827195Z     scale_ub=None,
2025-05-07T20:32:05.3827413Z     contiguous=False,
2025-05-07T20:32:05.3827652Z     compiled=False,
2025-05-07T20:32:05.3827865Z )
2025-05-07T20:32:05.3828184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.3828687Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.3829076Z 
2025-05-07T20:32:05.3829163Z     @given(
2025-05-07T20:32:05.3829420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.3829735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.3830054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.3830395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.3830727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.3831019Z     )
2025-05-07T20:32:05.3831377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.3831821Z     def test_silu_mul_quant(
2025-05-07T20:32:05.3832069Z         self,
2025-05-07T20:32:05.3832272Z         T: int,
2025-05-07T20:32:05.3832473Z         D: int,
2025-05-07T20:32:05.3832703Z         scale_ub: Optional[float],
2025-05-07T20:32:05.3832982Z         contiguous: bool,
2025-05-07T20:32:05.3833232Z         compiled: bool,
2025-05-07T20:32:05.3833457Z     ) -> None:
2025-05-07T20:32:05.3833691Z         torch.manual_seed(2025)
2025-05-07T20:32:05.3833943Z     
2025-05-07T20:32:05.3834219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.3834571Z     
2025-05-07T20:32:05.3834769Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.3835064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.3835382Z         x = x_sign * x_clamp
2025-05-07T20:32:05.3835634Z         x0 = x[:, :D]
2025-05-07T20:32:05.3835852Z         x1 = x[:, D:]
2025-05-07T20:32:05.3836067Z     
2025-05-07T20:32:05.3836263Z         if contiguous:
2025-05-07T20:32:05.3836495Z             x0 = x0.contiguous()
2025-05-07T20:32:05.3836761Z             x1 = x1.contiguous()
2025-05-07T20:32:05.3837010Z     
2025-05-07T20:32:05.3837204Z         if scale_ub is not None:
2025-05-07T20:32:05.3837486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.3837830Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.3838649Z             )
2025-05-07T20:32:05.3838846Z         else:
2025-05-07T20:32:05.3839066Z             scale_ub_tensor = None
2025-05-07T20:32:05.3839325Z     
2025-05-07T20:32:05.3839561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.3839888Z             op = silu_mul_quant
2025-05-07T20:32:05.3840331Z             if compiled:
2025-05-07T20:32:05.3840582Z                 op = torch.compile(op)
2025-05-07T20:32:05.3840885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3841167Z     
2025-05-07T20:32:05.3841362Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.3841536Z 
2025-05-07T20:32:05.3841638Z moe/activation_test.py:117: 
2025-05-07T20:32:05.3841937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3842270Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.3842560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3843266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.3843964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.3844502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.3845189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.3845861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.3846399Z     kernel = self.compile(
2025-05-07T20:32:05.3846943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.3847614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.3848021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3848255Z 
2025-05-07T20:32:05.3848589Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec97e6d0>
2025-05-07T20:32:05.3849686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.3851054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec29b5e0>}
2025-05-07T20:32:05.3852393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.3853413Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1a1b30>
2025-05-07T20:32:05.3853706Z 
2025-05-07T20:32:05.3853880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.3854424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.3854893Z                            module_map=module_map)
2025-05-07T20:32:05.3855261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.3855631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.3855902Z E       ^
2025-05-07T20:32:05.3856376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.3856826Z 
2025-05-07T20:32:05.3857252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.3857768Z 
2025-05-07T20:32:05.3857873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.3858293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.3858699Z     T=128,
2025-05-07T20:32:05.3858887Z     D=7168,
2025-05-07T20:32:05.3859088Z     scale_ub=None,
2025-05-07T20:32:05.3859447Z     contiguous=False,
2025-05-07T20:32:05.3859710Z     compiled=True,
2025-05-07T20:32:05.3859922Z )
2025-05-07T20:32:05.4637614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4638158Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.4638435Z 
2025-05-07T20:32:05.4638517Z     @given(
2025-05-07T20:32:05.4638761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4639286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4639890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4640748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4641366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4641913Z     )
2025-05-07T20:32:05.4642558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4643377Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4643863Z         self,
2025-05-07T20:32:05.4644223Z         T: int,
2025-05-07T20:32:05.4644602Z         D: int,
2025-05-07T20:32:05.4645013Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4645514Z         contiguous: bool,
2025-05-07T20:32:05.4645965Z         compiled: bool,
2025-05-07T20:32:05.4646386Z     ) -> None:
2025-05-07T20:32:05.4646790Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4647251Z     
2025-05-07T20:32:05.4647766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4648394Z     
2025-05-07T20:32:05.4648758Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4649307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4649791Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4650038Z         x0 = x[:, :D]
2025-05-07T20:32:05.4650267Z         x1 = x[:, D:]
2025-05-07T20:32:05.4650492Z     
2025-05-07T20:32:05.4650687Z         if contiguous:
2025-05-07T20:32:05.4650938Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4651385Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4651640Z     
2025-05-07T20:32:05.4651847Z         if scale_ub is not None:
2025-05-07T20:32:05.4652133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4652482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4652809Z             )
2025-05-07T20:32:05.4653018Z         else:
2025-05-07T20:32:05.4653233Z             scale_ub_tensor = None
2025-05-07T20:32:05.4653501Z     
2025-05-07T20:32:05.4653751Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4654080Z             op = silu_mul_quant
2025-05-07T20:32:05.4654348Z             if compiled:
2025-05-07T20:32:05.4654612Z                 op = torch.compile(op)
2025-05-07T20:32:05.4654916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4655200Z     
2025-05-07T20:32:05.4655402Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4655698Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4656004Z     
2025-05-07T20:32:05.4656251Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4656599Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4656900Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4657239Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4657759Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4658074Z     
2025-05-07T20:32:05.4658285Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4658481Z 
2025-05-07T20:32:05.4658590Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4658907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4659283Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4659621Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4660421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4661443Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4661991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4662680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4663378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4664097Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4664852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4665597Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4666330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4666981Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4667588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4668108Z     fn()
2025-05-07T20:32:05.4668618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4669202Z     self.fn.run(
2025-05-07T20:32:05.4669677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4670211Z     kernel = self.compile(
2025-05-07T20:32:05.4670758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4671413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4671820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4672167Z 
2025-05-07T20:32:05.4672397Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eca5d220>
2025-05-07T20:32:05.4673473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4674862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ecc73af0>}
2025-05-07T20:32:05.4676201Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4677220Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebbba5f0>
2025-05-07T20:32:05.4677514Z 
2025-05-07T20:32:05.4677710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4678236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4678709Z                            module_map=module_map)
2025-05-07T20:32:05.4679083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4679446Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4679724Z E       ^
2025-05-07T20:32:05.4680196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4680646Z 
2025-05-07T20:32:05.4681071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4681584Z 
2025-05-07T20:32:05.4681691Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4682113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4682613Z     T=128,
2025-05-07T20:32:05.4682807Z     D=7168,
2025-05-07T20:32:05.4683011Z     scale_ub=None,
2025-05-07T20:32:05.4683235Z     contiguous=False,
2025-05-07T20:32:05.4683465Z     compiled=False,
2025-05-07T20:32:05.4683680Z )
2025-05-07T20:32:05.8810922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8812221Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8812993Z 
2025-05-07T20:32:05.8813202Z     @given(
2025-05-07T20:32:05.8813823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8814458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8815086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8815765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8816432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8817022Z     )
2025-05-07T20:32:05.8817741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8818652Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8819032Z         self,
2025-05-07T20:32:05.8819271Z         T: int,
2025-05-07T20:32:05.8819483Z         D: int,
2025-05-07T20:32:05.8819711Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8819993Z         contiguous: bool,
2025-05-07T20:32:05.8820246Z         compiled: bool,
2025-05-07T20:32:05.8820473Z     ) -> None:
2025-05-07T20:32:05.8820706Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8820958Z     
2025-05-07T20:32:05.8821323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8821677Z     
2025-05-07T20:32:05.8821882Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8822182Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8822506Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8822758Z         x0 = x[:, :D]
2025-05-07T20:32:05.8822981Z         x1 = x[:, D:]
2025-05-07T20:32:05.8823201Z     
2025-05-07T20:32:05.8823565Z         if contiguous:
2025-05-07T20:32:05.8823809Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8824081Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8824332Z     
2025-05-07T20:32:05.8824527Z         if scale_ub is not None:
2025-05-07T20:32:05.8824811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8825160Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8825474Z             )
2025-05-07T20:32:05.8825681Z         else:
2025-05-07T20:32:05.8825904Z             scale_ub_tensor = None
2025-05-07T20:32:05.8826169Z     
2025-05-07T20:32:05.8826405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8826733Z             op = silu_mul_quant
2025-05-07T20:32:05.8826998Z             if compiled:
2025-05-07T20:32:05.8827250Z                 op = torch.compile(op)
2025-05-07T20:32:05.8827558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8827848Z     
2025-05-07T20:32:05.8828049Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8828228Z 
2025-05-07T20:32:05.8828333Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8828640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8836257Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8836593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8837297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8838003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8838548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8839242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8839923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8840667Z     kernel = self.compile(
2025-05-07T20:32:05.8841395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8842072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8842485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8842721Z 
2025-05-07T20:32:05.8842946Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec596f40>
2025-05-07T20:32:05.8844046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8845423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebf6f430>}
2025-05-07T20:32:05.8846770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8847800Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebb86470>
2025-05-07T20:32:05.8848091Z 
2025-05-07T20:32:05.8848268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8848794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8849316Z                            module_map=module_map)
2025-05-07T20:32:05.8849698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8850058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8850329Z E       ^
2025-05-07T20:32:05.8850799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8851261Z 
2025-05-07T20:32:05.8851810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8852332Z 
2025-05-07T20:32:05.8852439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8852863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8853270Z     T=4096,
2025-05-07T20:32:05.8853462Z     D=5120,
2025-05-07T20:32:05.8853663Z     scale_ub=1200.0,
2025-05-07T20:32:05.8853899Z     contiguous=True,
2025-05-07T20:32:05.8854122Z     compiled=False,
2025-05-07T20:32:05.8854338Z )
2025-05-07T20:32:05.8854670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8855177Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8855457Z 
2025-05-07T20:32:05.8855539Z     @given(
2025-05-07T20:32:05.8855777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8856105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8856427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8856771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8857107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8857396Z     )
2025-05-07T20:32:05.8857749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8858197Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8858449Z         self,
2025-05-07T20:32:05.8858643Z         T: int,
2025-05-07T20:32:05.8858844Z         D: int,
2025-05-07T20:32:05.8859071Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8859344Z         contiguous: bool,
2025-05-07T20:32:05.8859596Z         compiled: bool,
2025-05-07T20:32:05.8859878Z     ) -> None:
2025-05-07T20:32:05.8860106Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8860364Z     
2025-05-07T20:32:05.8860646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8860989Z     
2025-05-07T20:32:05.8861350Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8861651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8861968Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8862221Z         x0 = x[:, :D]
2025-05-07T20:32:05.8862450Z         x1 = x[:, D:]
2025-05-07T20:32:05.8862657Z     
2025-05-07T20:32:05.8862848Z         if contiguous:
2025-05-07T20:32:05.8863099Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8863354Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8863609Z     
2025-05-07T20:32:05.8863808Z         if scale_ub is not None:
2025-05-07T20:32:05.8864089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8864426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8864743Z             )
2025-05-07T20:32:05.8864941Z         else:
2025-05-07T20:32:05.8865155Z             scale_ub_tensor = None
2025-05-07T20:32:05.8865416Z     
2025-05-07T20:32:05.8865648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8865982Z             op = silu_mul_quant
2025-05-07T20:32:05.8866245Z             if compiled:
2025-05-07T20:32:05.8866508Z                 op = torch.compile(op)
2025-05-07T20:32:05.8866812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8867099Z     
2025-05-07T20:32:05.8867297Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8867469Z 
2025-05-07T20:32:05.8867572Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8867875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8868217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8868507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8869205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8869948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8870582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8871282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8871952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8872488Z     kernel = self.compile(
2025-05-07T20:32:05.8873041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8873694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8874098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8874330Z 
2025-05-07T20:32:05.8874545Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebec8820>
2025-05-07T20:32:05.8875636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8877009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebcb6430>}
2025-05-07T20:32:05.8878367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8879384Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebb2cf70>
2025-05-07T20:32:05.8879674Z 
2025-05-07T20:32:05.8879852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8880381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8880859Z                            module_map=module_map)
2025-05-07T20:32:05.8881239Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8881678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8881941Z E       ^
2025-05-07T20:32:05.8882409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8882857Z 
2025-05-07T20:32:05.8883280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8883793Z 
2025-05-07T20:32:05.8883904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8884320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8884731Z     T=1,
2025-05-07T20:32:05.8884924Z     D=5120,
2025-05-07T20:32:05.8885115Z     scale_ub=None,
2025-05-07T20:32:05.8885335Z     contiguous=True,
2025-05-07T20:32:05.8885568Z     compiled=True,
2025-05-07T20:32:05.8885772Z )
2025-05-07T20:32:06.4006845Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.4008998Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:06.4010392Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.4011817Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.4013462Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.4014860Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.4016160Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.4017540Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.4018954Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.4020211Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:06.4021623Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.4022846Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:06.4023892Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:06.4024906Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:06.4026124Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.4027565Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.4028692Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.4029724Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:06.4030909Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.4032275Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.4033353Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.4034267Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.4035018Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:06.4036050Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.5887288Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.5888557Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:06.5889966Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.5891408Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.5892798Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.5894197Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.5895513Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.5896907Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.5898318Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.5899571Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:06.5900931Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.5902240Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:06.5903281Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:06.5904314Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:06.5905544Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.5906849Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.5907980Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.5909037Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:06.5910219Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.5911666Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.5912739Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.5913667Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.5914415Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:06.5915429Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0872821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.0873417Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.0873815Z 
2025-05-07T20:32:07.0873951Z     @given(
2025-05-07T20:32:07.0874281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.0874733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.0875201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.0875550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.0875888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.0876186Z     )
2025-05-07T20:32:07.0876552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.0877005Z     def test_silu_mul_quant(
2025-05-07T20:32:07.0877260Z         self,
2025-05-07T20:32:07.0877465Z         T: int,
2025-05-07T20:32:07.0877671Z         D: int,
2025-05-07T20:32:07.0877893Z         scale_ub: Optional[float],
2025-05-07T20:32:07.0878175Z         contiguous: bool,
2025-05-07T20:32:07.0878426Z         compiled: bool,
2025-05-07T20:32:07.0878655Z     ) -> None:
2025-05-07T20:32:07.0878891Z         torch.manual_seed(2025)
2025-05-07T20:32:07.0879335Z     
2025-05-07T20:32:07.0879614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.0879973Z     
2025-05-07T20:32:07.0880179Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.0880478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.0880803Z         x = x_sign * x_clamp
2025-05-07T20:32:07.0881061Z         x0 = x[:, :D]
2025-05-07T20:32:07.0881285Z         x1 = x[:, D:]
2025-05-07T20:32:07.0881501Z     
2025-05-07T20:32:07.0881699Z         if contiguous:
2025-05-07T20:32:07.0881937Z             x0 = x0.contiguous()
2025-05-07T20:32:07.0882210Z             x1 = x1.contiguous()
2025-05-07T20:32:07.0882461Z     
2025-05-07T20:32:07.0882654Z         if scale_ub is not None:
2025-05-07T20:32:07.0882945Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.0883295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.0883618Z             )
2025-05-07T20:32:07.0883831Z         else:
2025-05-07T20:32:07.0884053Z             scale_ub_tensor = None
2025-05-07T20:32:07.0884315Z     
2025-05-07T20:32:07.0884556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0884888Z             op = silu_mul_quant
2025-05-07T20:32:07.0885155Z             if compiled:
2025-05-07T20:32:07.0885410Z                 op = torch.compile(op)
2025-05-07T20:32:07.0885722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.0886011Z     
2025-05-07T20:32:07.0886209Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.0886513Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.0886817Z     
2025-05-07T20:32:07.0887065Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0887415Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.0887743Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.0888076Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.0888583Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0888909Z     
2025-05-07T20:32:07.0889124Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.0889325Z 
2025-05-07T20:32:07.0889438Z moe/activation_test.py:126: 
2025-05-07T20:32:07.0889749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0890099Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.0890439Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0891233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.0891994Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.0892554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.0893248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.0893961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.0894691Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0895453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.0896198Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0896939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.0897588Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.0898199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.0898718Z     fn()
2025-05-07T20:32:07.0899238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.0899917Z     self.fn.run(
2025-05-07T20:32:07.0900392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.0900932Z     kernel = self.compile(
2025-05-07T20:32:07.0901553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.0902214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.0902619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0902860Z 
2025-05-07T20:32:07.0903074Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebdb2880>
2025-05-07T20:32:07.0904181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.0905569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebcb6940>}
2025-05-07T20:32:07.0906915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.0907930Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb6a1570>
2025-05-07T20:32:07.0908230Z 
2025-05-07T20:32:07.0908402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.0908940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.0909414Z                            module_map=module_map)
2025-05-07T20:32:07.0909791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.0910674Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.0910964Z E       ^
2025-05-07T20:32:07.0911437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0911897Z 
2025-05-07T20:32:07.0912316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.0912834Z 
2025-05-07T20:32:07.0912942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.0913368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.0913775Z     T=2048,
2025-05-07T20:32:07.0913978Z     D=5120,
2025-05-07T20:32:07.0914183Z     scale_ub=None,
2025-05-07T20:32:07.0914403Z     contiguous=True,
2025-05-07T20:32:07.0914639Z     compiled=True,
2025-05-07T20:32:07.0914861Z )
2025-05-07T20:32:07.5584892Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.5586494Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:07.5588193Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.5589645Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.5591019Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.5592628Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.5593941Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.5595312Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.5596733Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.5597984Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:07.5599212Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.5600486Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:07.5601536Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.5602562Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:07.5603897Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.5605521Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.5606922Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.5608218Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:07.5609699Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.5611404Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.5612734Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.5613864Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.5614784Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:07.5616066Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.7459606Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.7461362Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:07.7462712Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.7464132Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.7465538Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.7466935Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.7468267Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.7469665Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.7471095Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.7472355Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:07.7473757Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.7474988Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:07.7476041Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.7477072Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:07.7478317Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.7479618Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.7480743Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.7481799Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:07.7482983Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.7484354Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.7485513Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.7486426Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.7487168Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:07.7488213Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2433764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2434580Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:08.2434964Z 
2025-05-07T20:32:08.2435083Z     @given(
2025-05-07T20:32:08.2435431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2435897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2436286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2436625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2436958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2437253Z     )
2025-05-07T20:32:08.2437615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2438061Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2438314Z         self,
2025-05-07T20:32:08.2438517Z         T: int,
2025-05-07T20:32:08.2438718Z         D: int,
2025-05-07T20:32:08.2438945Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2439228Z         contiguous: bool,
2025-05-07T20:32:08.2439471Z         compiled: bool,
2025-05-07T20:32:08.2439710Z     ) -> None:
2025-05-07T20:32:08.2439939Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2440542Z     
2025-05-07T20:32:08.2440836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2441192Z     
2025-05-07T20:32:08.2441392Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2441697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2442021Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2442274Z         x0 = x[:, :D]
2025-05-07T20:32:08.2442494Z         x1 = x[:, D:]
2025-05-07T20:32:08.2442712Z     
2025-05-07T20:32:08.2442913Z         if contiguous:
2025-05-07T20:32:08.2443148Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2443418Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2443671Z     
2025-05-07T20:32:08.2443869Z         if scale_ub is not None:
2025-05-07T20:32:08.2444153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2444498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2444812Z             )
2025-05-07T20:32:08.2445014Z         else:
2025-05-07T20:32:08.2445246Z             scale_ub_tensor = None
2025-05-07T20:32:08.2445501Z     
2025-05-07T20:32:08.2445739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2446064Z             op = silu_mul_quant
2025-05-07T20:32:08.2446327Z             if compiled:
2025-05-07T20:32:08.2446585Z                 op = torch.compile(op)
2025-05-07T20:32:08.2446890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2447175Z     
2025-05-07T20:32:08.2447368Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.2455212Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.2455561Z     
2025-05-07T20:32:08.2455817Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2456177Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.2456493Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.2456824Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.2457205Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.2457698Z     
2025-05-07T20:32:08.2457915Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.2458122Z 
2025-05-07T20:32:08.2458231Z moe/activation_test.py:126: 
2025-05-07T20:32:08.2458555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2458917Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.2459256Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.2460065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.2460826Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.2461492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2462183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2462896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.2463637Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.2464406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:08.2465151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.2465897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.2466553Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.2467165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.2467693Z     fn()
2025-05-07T20:32:08.2468291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.2468889Z     self.fn.run(
2025-05-07T20:32:08.2469366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2469930Z     kernel = self.compile(
2025-05-07T20:32:08.2470518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2471183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2471591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2471835Z 
2025-05-07T20:32:08.2472051Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec06adf0>
2025-05-07T20:32:08.2473133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2474534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb6519d0>}
2025-05-07T20:32:08.2475882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2476916Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb71a370>
2025-05-07T20:32:08.2477217Z 
2025-05-07T20:32:08.2477392Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2477931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2478398Z                            module_map=module_map)
2025-05-07T20:32:08.2478784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2479161Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.2479526Z E       ^
2025-05-07T20:32:08.2480042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2480503Z 
2025-05-07T20:32:08.2480921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2481431Z 
2025-05-07T20:32:08.2481544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.2481962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.2482378Z     T=128,
2025-05-07T20:32:08.2482582Z     D=5120,
2025-05-07T20:32:08.2482785Z     scale_ub=None,
2025-05-07T20:32:08.2483010Z     contiguous=True,
2025-05-07T20:32:08.2483253Z     compiled=True,
2025-05-07T20:32:08.2483471Z )
2025-05-07T20:32:08.7721574Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.7722970Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:08.7724319Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.7725764Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.7727159Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.7728729Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.7730065Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.7731469Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.7732885Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.7734141Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:08.7735389Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.7736612Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:08.7737657Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:08.7738686Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:08.7739935Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.7741719Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.7742857Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.7743913Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:08.7745112Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.7746474Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.7747547Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.7748480Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.7749238Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:08.7750325Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9605960Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.9607424Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:08.9608779Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.9610256Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.9611646Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.9613036Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9614331Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.9615710Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9617124Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.9618364Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:08.9619706Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.9620928Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:08.9622053Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:08.9623084Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:08.9624303Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.9625606Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.9626711Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.9627769Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:08.9628943Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.9630355Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.9631546Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9632465Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9633219Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:08.9634246Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7796067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7796836Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.7797194Z 
2025-05-07T20:32:09.7797305Z     @given(
2025-05-07T20:32:09.7797655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7798001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7798316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7798667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7799017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7799316Z     )
2025-05-07T20:32:09.7799686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7800150Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7800414Z         self,
2025-05-07T20:32:09.7800617Z         T: int,
2025-05-07T20:32:09.7800834Z         D: int,
2025-05-07T20:32:09.7801070Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7801351Z         contiguous: bool,
2025-05-07T20:32:09.7801607Z         compiled: bool,
2025-05-07T20:32:09.7801843Z     ) -> None:
2025-05-07T20:32:09.7802063Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7802313Z     
2025-05-07T20:32:09.7802785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7803131Z     
2025-05-07T20:32:09.7803333Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7803633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7803952Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7804206Z         x0 = x[:, :D]
2025-05-07T20:32:09.7804434Z         x1 = x[:, D:]
2025-05-07T20:32:09.7804646Z     
2025-05-07T20:32:09.7804843Z         if contiguous:
2025-05-07T20:32:09.7805091Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7805354Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7805606Z     
2025-05-07T20:32:09.7805806Z         if scale_ub is not None:
2025-05-07T20:32:09.7806087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7806424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7806744Z             )
2025-05-07T20:32:09.7806943Z         else:
2025-05-07T20:32:09.7807165Z             scale_ub_tensor = None
2025-05-07T20:32:09.7807424Z     
2025-05-07T20:32:09.7807665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7807983Z             op = silu_mul_quant
2025-05-07T20:32:09.7808241Z             if compiled:
2025-05-07T20:32:09.7808499Z                 op = torch.compile(op)
2025-05-07T20:32:09.7808797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7809084Z     
2025-05-07T20:32:09.7809286Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.7809578Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.7809878Z     
2025-05-07T20:32:09.7810126Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7810466Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.7810769Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.7811098Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.7811469Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7811969Z     
2025-05-07T20:32:09.7812189Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.7812389Z 
2025-05-07T20:32:09.7812501Z moe/activation_test.py:126: 
2025-05-07T20:32:09.7812801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7813147Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.7813485Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7814283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.7815035Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.7815590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7816278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7816971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.7817709Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.7818468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:09.7819214Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.7819942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.7820584Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.7821274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.7821799Z     fn()
2025-05-07T20:32:09.7822311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.7823003Z     self.fn.run(
2025-05-07T20:32:09.7823477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7824008Z     kernel = self.compile(
2025-05-07T20:32:09.7824553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7825215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7825624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7825859Z 
2025-05-07T20:32:09.7826070Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebdf32e0>
2025-05-07T20:32:09.7827164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7828549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb9bb550>}
2025-05-07T20:32:09.7829905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7830928Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb1b59b0>
2025-05-07T20:32:09.7831223Z 
2025-05-07T20:32:09.7831395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7831927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7832403Z                            module_map=module_map)
2025-05-07T20:32:09.7832774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7833144Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.7833509Z E       ^
2025-05-07T20:32:09.7834054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7834615Z 
2025-05-07T20:32:09.7835121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7835754Z 
2025-05-07T20:32:09.7835865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7836347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7836814Z     T=4096,
2025-05-07T20:32:09.7837019Z     D=5120,
2025-05-07T20:32:09.7837234Z     scale_ub=None,
2025-05-07T20:32:09.7837463Z     contiguous=True,
2025-05-07T20:32:09.7837709Z     compiled=True,
2025-05-07T20:32:09.7837936Z )
2025-05-07T20:32:10.3095927Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.3098122Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:10.3100811Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.3102409Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.3103814Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.3105230Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3106728Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.3108134Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3109577Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.3110832Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:10.3112062Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.3113277Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:10.3114339Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:10.3115369Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:10.3116704Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.3118004Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.3119139Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.3120209Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:10.3121467Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.3122835Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.3123925Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3124867Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.3125627Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:10.3126670Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.4986388Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.4987503Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:10.4989036Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.4990483Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.4991934Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.4993326Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.4994660Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.4996034Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.4997470Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.4998737Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:10.5000073Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.5001294Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:10.5002337Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:10.5003376Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:10.5004608Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.5005897Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.5007019Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.5008065Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:10.5009243Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.5010613Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.5011789Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.5012711Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.5013463Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:10.5014493Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1449984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1450637Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.1451015Z 
2025-05-07T20:32:11.1451136Z     @given(
2025-05-07T20:32:11.1451483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1451914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1452303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1452655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1452994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1453294Z     )
2025-05-07T20:32:11.1453658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1454111Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1454375Z         self,
2025-05-07T20:32:11.1454586Z         T: int,
2025-05-07T20:32:11.1454792Z         D: int,
2025-05-07T20:32:11.1455025Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1455311Z         contiguous: bool,
2025-05-07T20:32:11.1455559Z         compiled: bool,
2025-05-07T20:32:11.1455807Z     ) -> None:
2025-05-07T20:32:11.1456042Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1456303Z     
2025-05-07T20:32:11.1456940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1457316Z     
2025-05-07T20:32:11.1457529Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1457861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1458181Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1458440Z         x0 = x[:, :D]
2025-05-07T20:32:11.1458671Z         x1 = x[:, D:]
2025-05-07T20:32:11.1458885Z     
2025-05-07T20:32:11.1459087Z         if contiguous:
2025-05-07T20:32:11.1459344Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1459618Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1459871Z     
2025-05-07T20:32:11.1460078Z         if scale_ub is not None:
2025-05-07T20:32:11.1460368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1460716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1461044Z             )
2025-05-07T20:32:11.1461360Z         else:
2025-05-07T20:32:11.1461579Z             scale_ub_tensor = None
2025-05-07T20:32:11.1461859Z     
2025-05-07T20:32:11.1462112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1462437Z             op = silu_mul_quant
2025-05-07T20:32:11.1462704Z             if compiled:
2025-05-07T20:32:11.1462971Z                 op = torch.compile(op)
2025-05-07T20:32:11.1463276Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1463565Z     
2025-05-07T20:32:11.1463772Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.1464072Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.1464368Z     
2025-05-07T20:32:11.1464617Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1464963Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.1465263Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.1465591Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.1465961Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.1466444Z     
2025-05-07T20:32:11.1466664Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.1466864Z 
2025-05-07T20:32:11.1466980Z moe/activation_test.py:126: 
2025-05-07T20:32:11.1467282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1467632Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.1467974Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.1468770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.1469527Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.1470094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1470810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1471538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.1472270Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.1473027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:11.1473779Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.1474511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.1475153Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.1475771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.1476307Z     fn()
2025-05-07T20:32:11.1476817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.1477525Z     self.fn.run(
2025-05-07T20:32:11.1478005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1478546Z     kernel = self.compile(
2025-05-07T20:32:11.1479090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1479751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1480166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1480401Z 
2025-05-07T20:32:11.1480619Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb25e640>
2025-05-07T20:32:11.1481762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1483158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb907790>}
2025-05-07T20:32:11.1484521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1485545Z context = <triton._C.libtriton.ir.context object at 0x7fd7eac81a70>
2025-05-07T20:32:11.1485840Z 
2025-05-07T20:32:11.1486013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1486546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1487021Z                            module_map=module_map)
2025-05-07T20:32:11.1487395Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1487762Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.1488043Z E       ^
2025-05-07T20:32:11.1488604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1489063Z 
2025-05-07T20:32:11.1489482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1490001Z 
2025-05-07T20:32:11.1490108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1490535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1490949Z     T=16384,
2025-05-07T20:32:11.1491149Z     D=5120,
2025-05-07T20:32:11.1491354Z     scale_ub=None,
2025-05-07T20:32:11.1491576Z     contiguous=True,
2025-05-07T20:32:11.1491804Z     compiled=True,
2025-05-07T20:32:11.1492021Z )
2025-05-07T20:32:11.1916439Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:11.1917715Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:11.1919072Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:11.1920058Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:11.1921163Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:11.3137356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.3138061Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.3138389Z 
2025-05-07T20:32:11.3138732Z     @given(
2025-05-07T20:32:11.3138992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.3139327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.3139646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.3139992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.3140644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.3140951Z     )
2025-05-07T20:32:11.3141383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.3141847Z     def test_silu_mul_quant(
2025-05-07T20:32:11.3142111Z         self,
2025-05-07T20:32:11.3142315Z         T: int,
2025-05-07T20:32:11.3142528Z         D: int,
2025-05-07T20:32:11.3142762Z         scale_ub: Optional[float],
2025-05-07T20:32:11.3143043Z         contiguous: bool,
2025-05-07T20:32:11.3143300Z         compiled: bool,
2025-05-07T20:32:11.3143545Z     ) -> None:
2025-05-07T20:32:11.3143782Z         torch.manual_seed(2025)
2025-05-07T20:32:11.3144041Z     
2025-05-07T20:32:11.3144329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.3144677Z     
2025-05-07T20:32:11.3144883Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.3145189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.3145510Z         x = x_sign * x_clamp
2025-05-07T20:32:11.3145770Z         x0 = x[:, :D]
2025-05-07T20:32:11.3145999Z         x1 = x[:, D:]
2025-05-07T20:32:11.3146222Z     
2025-05-07T20:32:11.3146415Z         if contiguous:
2025-05-07T20:32:11.3146662Z             x0 = x0.contiguous()
2025-05-07T20:32:11.3146936Z             x1 = x1.contiguous()
2025-05-07T20:32:11.3147186Z     
2025-05-07T20:32:11.3147397Z         if scale_ub is not None:
2025-05-07T20:32:11.3147685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.3148032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.3148354Z             )
2025-05-07T20:32:11.3148729Z         else:
2025-05-07T20:32:11.3148948Z             scale_ub_tensor = None
2025-05-07T20:32:11.3149213Z     
2025-05-07T20:32:11.3149460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.3149782Z             op = silu_mul_quant
2025-05-07T20:32:11.3150046Z             if compiled:
2025-05-07T20:32:11.3150308Z                 op = torch.compile(op)
2025-05-07T20:32:11.3150610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.3150902Z     
2025-05-07T20:32:11.3151107Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.3151405Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.3151706Z     
2025-05-07T20:32:11.3151972Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.3152321Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.3152622Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.3152952Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.3153336Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.3153662Z     
2025-05-07T20:32:11.3153874Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.3154081Z 
2025-05-07T20:32:11.3154188Z moe/activation_test.py:126: 
2025-05-07T20:32:11.3154500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3154847Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.3155187Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.3155995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.3156760Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.3157321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.3158013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.3158840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.3159583Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.3160350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:11.3161107Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.3161892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.3162536Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.3163152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.3163693Z     fn()
2025-05-07T20:32:11.3164223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.3164821Z     self.fn.run(
2025-05-07T20:32:11.3165295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.3165834Z     kernel = self.compile(
2025-05-07T20:32:11.3166373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.3167032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.3167444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3167679Z 
2025-05-07T20:32:11.3167900Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb2b9340>
2025-05-07T20:32:11.3168992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.3170487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb259b80>}
2025-05-07T20:32:11.3171846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.3172864Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea839430>
2025-05-07T20:32:11.3173154Z 
2025-05-07T20:32:11.3173336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.3173875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.3174345Z                            module_map=module_map)
2025-05-07T20:32:11.3174725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.3175101Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.3175375Z E       ^
2025-05-07T20:32:11.3175841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.3176288Z 
2025-05-07T20:32:11.3176717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.3177229Z 
2025-05-07T20:32:11.3177335Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.3177760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.3178170Z     T=1,
2025-05-07T20:32:11.3178357Z     D=5120,
2025-05-07T20:32:11.3178560Z     scale_ub=1200.0,
2025-05-07T20:32:11.3178797Z     contiguous=True,
2025-05-07T20:32:11.3179024Z     compiled=True,
2025-05-07T20:32:11.3179241Z )
2025-05-07T20:32:11.4885287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4886945Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.4887487Z 
2025-05-07T20:32:11.4887661Z     @given(
2025-05-07T20:32:11.4888130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4888779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4889414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4890151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4890802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4891179Z     )
2025-05-07T20:32:11.4891581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4892109Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4892392Z         self,
2025-05-07T20:32:11.4892601Z         T: int,
2025-05-07T20:32:11.4892813Z         D: int,
2025-05-07T20:32:11.4893052Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4893353Z         contiguous: bool,
2025-05-07T20:32:11.4893630Z         compiled: bool,
2025-05-07T20:32:11.4893880Z     ) -> None:
2025-05-07T20:32:11.4894111Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4894380Z     
2025-05-07T20:32:11.4894682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4895070Z     
2025-05-07T20:32:11.4895275Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.4895601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.4895947Z         x = x_sign * x_clamp
2025-05-07T20:32:11.4896218Z         x0 = x[:, :D]
2025-05-07T20:32:11.4896453Z         x1 = x[:, D:]
2025-05-07T20:32:11.4896674Z     
2025-05-07T20:32:11.4896874Z         if contiguous:
2025-05-07T20:32:11.4897131Z             x0 = x0.contiguous()
2025-05-07T20:32:11.4897412Z             x1 = x1.contiguous()
2025-05-07T20:32:11.4897681Z     
2025-05-07T20:32:11.4897891Z         if scale_ub is not None:
2025-05-07T20:32:11.4898192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.4898703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.4899030Z             )
2025-05-07T20:32:11.4899233Z         else:
2025-05-07T20:32:11.4899447Z             scale_ub_tensor = None
2025-05-07T20:32:11.4899712Z     
2025-05-07T20:32:11.4899953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4900270Z             op = silu_mul_quant
2025-05-07T20:32:11.4900535Z             if compiled:
2025-05-07T20:32:11.4900797Z                 op = torch.compile(op)
2025-05-07T20:32:11.4901225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4901515Z     
2025-05-07T20:32:11.4901723Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.4901892Z 
2025-05-07T20:32:11.4901999Z moe/activation_test.py:117: 
2025-05-07T20:32:11.4902303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4902652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.4902944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4903517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.4904083Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.4904745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.4905430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.4905972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.4906656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.4907323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.4907855Z     kernel = self.compile(
2025-05-07T20:32:11.4908409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.4909159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4909568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4909807Z 
2025-05-07T20:32:11.4910020Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb6c66a0>
2025-05-07T20:32:11.4911127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.4912510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb831e50>}
2025-05-07T20:32:11.4913856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.4914872Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea6f18b0>
2025-05-07T20:32:11.4915172Z 
2025-05-07T20:32:11.4915343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.4915873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4916345Z                            module_map=module_map)
2025-05-07T20:32:11.4916718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4917075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.4917343Z E       ^
2025-05-07T20:32:11.4917805Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4918260Z 
2025-05-07T20:32:11.4918676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.4919195Z 
2025-05-07T20:32:11.4919421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4919842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4920253Z     T=1,
2025-05-07T20:32:11.4920445Z     D=5120,
2025-05-07T20:32:11.4920647Z     scale_ub=None,
2025-05-07T20:32:11.4920868Z     contiguous=False,
2025-05-07T20:32:11.4921106Z     compiled=True,
2025-05-07T20:32:11.4921324Z )
2025-05-07T20:32:11.5732324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5732898Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.5733291Z 
2025-05-07T20:32:11.5733405Z     @given(
2025-05-07T20:32:11.5733730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5734175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5734519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5734867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5735221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5735544Z     )
2025-05-07T20:32:11.5735903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5736356Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5736602Z         self,
2025-05-07T20:32:11.5736810Z         T: int,
2025-05-07T20:32:11.5737020Z         D: int,
2025-05-07T20:32:11.5737248Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5737526Z         contiguous: bool,
2025-05-07T20:32:11.5737777Z         compiled: bool,
2025-05-07T20:32:11.5738010Z     ) -> None:
2025-05-07T20:32:11.5738230Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5738482Z     
2025-05-07T20:32:11.5738762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5739105Z     
2025-05-07T20:32:11.5739310Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5739614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5739929Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5740737Z         x0 = x[:, :D]
2025-05-07T20:32:11.5740973Z         x1 = x[:, D:]
2025-05-07T20:32:11.5741248Z     
2025-05-07T20:32:11.5741442Z         if contiguous:
2025-05-07T20:32:11.5741684Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5741948Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5742209Z     
2025-05-07T20:32:11.5742411Z         if scale_ub is not None:
2025-05-07T20:32:11.5742687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5743037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5743357Z             )
2025-05-07T20:32:11.5743558Z         else:
2025-05-07T20:32:11.5743773Z             scale_ub_tensor = None
2025-05-07T20:32:11.5744037Z     
2025-05-07T20:32:11.5744278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5744598Z             op = silu_mul_quant
2025-05-07T20:32:11.5744864Z             if compiled:
2025-05-07T20:32:11.5745121Z                 op = torch.compile(op)
2025-05-07T20:32:11.5745434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5745725Z     
2025-05-07T20:32:11.5745929Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.5746220Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.5746522Z     
2025-05-07T20:32:11.5746768Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5747111Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.5747417Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.5747741Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.5748109Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5748423Z     
2025-05-07T20:32:11.5748635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5748836Z 
2025-05-07T20:32:11.5748950Z moe/activation_test.py:126: 
2025-05-07T20:32:11.5749284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5749790Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.5750131Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5750930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.5751687Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5752244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5752936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5753624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.5754352Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5755108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:11.5755867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5756591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.5757234Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5757845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.5758369Z     fn()
2025-05-07T20:32:11.5758880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.5759468Z     self.fn.run(
2025-05-07T20:32:11.5759941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5760478Z     kernel = self.compile(
2025-05-07T20:32:11.5761114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5761790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5762198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5762437Z 
2025-05-07T20:32:11.5762652Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb8d72e0>
2025-05-07T20:32:11.5763734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5765133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb34bee0>}
2025-05-07T20:32:11.5766481Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5767499Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea2ae4b0>
2025-05-07T20:32:11.5767799Z 
2025-05-07T20:32:11.5767970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5768501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5768980Z                            module_map=module_map)
2025-05-07T20:32:11.5769358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5769726Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.5770001Z E       ^
2025-05-07T20:32:11.5770463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5770926Z 
2025-05-07T20:32:11.5771353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5771963Z 
2025-05-07T20:32:11.5772071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5772494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5772901Z     T=1,
2025-05-07T20:32:11.5773099Z     D=5120,
2025-05-07T20:32:11.5773302Z     scale_ub=None,
2025-05-07T20:32:11.5773527Z     contiguous=True,
2025-05-07T20:32:11.5773762Z     compiled=False,
2025-05-07T20:32:11.5773983Z )
2025-05-07T20:32:11.9644446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9645225Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.9645604Z 
2025-05-07T20:32:11.9645723Z     @given(
2025-05-07T20:32:11.9646049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9646475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9646803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9647189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9647526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9647831Z     )
2025-05-07T20:32:11.9648193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9648655Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9648904Z         self,
2025-05-07T20:32:11.9649115Z         T: int,
2025-05-07T20:32:11.9649324Z         D: int,
2025-05-07T20:32:11.9649550Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9649838Z         contiguous: bool,
2025-05-07T20:32:11.9650095Z         compiled: bool,
2025-05-07T20:32:11.9650327Z     ) -> None:
2025-05-07T20:32:11.9650561Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9650816Z     
2025-05-07T20:32:11.9651095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9651455Z     
2025-05-07T20:32:11.9651664Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9652325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9652656Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9652913Z         x0 = x[:, :D]
2025-05-07T20:32:11.9653133Z         x1 = x[:, D:]
2025-05-07T20:32:11.9653354Z     
2025-05-07T20:32:11.9653551Z         if contiguous:
2025-05-07T20:32:11.9653795Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9654061Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9654313Z     
2025-05-07T20:32:11.9654518Z         if scale_ub is not None:
2025-05-07T20:32:11.9654797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9655151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9655476Z             )
2025-05-07T20:32:11.9655675Z         else:
2025-05-07T20:32:11.9655898Z             scale_ub_tensor = None
2025-05-07T20:32:11.9656163Z     
2025-05-07T20:32:11.9656403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9656741Z             op = silu_mul_quant
2025-05-07T20:32:11.9657021Z             if compiled:
2025-05-07T20:32:11.9657273Z                 op = torch.compile(op)
2025-05-07T20:32:11.9657582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9657867Z     
2025-05-07T20:32:11.9658062Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9658244Z 
2025-05-07T20:32:11.9658349Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9658659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9659008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9659295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9659995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9660696Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9661335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9662217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9662893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9663438Z     kernel = self.compile(
2025-05-07T20:32:11.9663984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9664655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9665066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9665302Z 
2025-05-07T20:32:11.9665521Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea674c40>
2025-05-07T20:32:11.9666599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9668007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead64dc0>}
2025-05-07T20:32:11.9669367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9670395Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea20f630>
2025-05-07T20:32:11.9670688Z 
2025-05-07T20:32:11.9670860Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9671392Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9671877Z                            module_map=module_map)
2025-05-07T20:32:11.9672260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9672618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9673002Z E       ^
2025-05-07T20:32:11.9673474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9673928Z 
2025-05-07T20:32:11.9674345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9674875Z 
2025-05-07T20:32:11.9674986Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9675415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9675837Z     T=128,
2025-05-07T20:32:11.9676031Z     D=5120,
2025-05-07T20:32:11.9676235Z     scale_ub=None,
2025-05-07T20:32:11.9676468Z     contiguous=False,
2025-05-07T20:32:11.9676700Z     compiled=True,
2025-05-07T20:32:11.9676915Z )
2025-05-07T20:32:11.9677246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9677779Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.9678057Z 
2025-05-07T20:32:11.9678137Z     @given(
2025-05-07T20:32:11.9678383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9678738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9679058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9687676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9688037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9688339Z     )
2025-05-07T20:32:11.9688696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9689157Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9689417Z         self,
2025-05-07T20:32:11.9689617Z         T: int,
2025-05-07T20:32:11.9689827Z         D: int,
2025-05-07T20:32:11.9690065Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9690342Z         contiguous: bool,
2025-05-07T20:32:11.9690597Z         compiled: bool,
2025-05-07T20:32:11.9690960Z     ) -> None:
2025-05-07T20:32:11.9691183Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9691446Z     
2025-05-07T20:32:11.9691733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9692091Z     
2025-05-07T20:32:11.9692291Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9692607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9692934Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9693183Z         x0 = x[:, :D]
2025-05-07T20:32:11.9693414Z         x1 = x[:, D:]
2025-05-07T20:32:11.9693637Z     
2025-05-07T20:32:11.9693828Z         if contiguous:
2025-05-07T20:32:11.9694074Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9694347Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9694594Z     
2025-05-07T20:32:11.9694798Z         if scale_ub is not None:
2025-05-07T20:32:11.9695084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9695430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9695773Z             )
2025-05-07T20:32:11.9695984Z         else:
2025-05-07T20:32:11.9696200Z             scale_ub_tensor = None
2025-05-07T20:32:11.9696470Z     
2025-05-07T20:32:11.9696715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9697049Z             op = silu_mul_quant
2025-05-07T20:32:11.9697308Z             if compiled:
2025-05-07T20:32:11.9697566Z                 op = torch.compile(op)
2025-05-07T20:32:11.9697875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9698152Z     
2025-05-07T20:32:11.9698356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9698526Z 
2025-05-07T20:32:11.9698639Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9698939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9699287Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9699584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9700234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.9700819Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.9701636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9702341Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9702882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9703578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9704250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9704805Z     kernel = self.compile(
2025-05-07T20:32:11.9705362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9706040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9706459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9706694Z 
2025-05-07T20:32:11.9706912Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea65c700>
2025-05-07T20:32:11.9708003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9709380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead64670>}
2025-05-07T20:32:11.9710740Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9711865Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea0d18f0>
2025-05-07T20:32:11.9712160Z 
2025-05-07T20:32:11.9712336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9712882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9713365Z                            module_map=module_map)
2025-05-07T20:32:11.9713746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9714111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9714386Z E       ^
2025-05-07T20:32:11.9714864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9715315Z 
2025-05-07T20:32:11.9715731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9716260Z 
2025-05-07T20:32:11.9716368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9716810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9717221Z     T=128,
2025-05-07T20:32:11.9717415Z     D=7168,
2025-05-07T20:32:11.9717630Z     scale_ub=1200.0,
2025-05-07T20:32:11.9717868Z     contiguous=False,
2025-05-07T20:32:11.9718100Z     compiled=False,
2025-05-07T20:32:11.9718321Z )
2025-05-07T20:32:12.1251287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1252069Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.1252461Z 
2025-05-07T20:32:12.1252586Z     @given(
2025-05-07T20:32:12.1252847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1253184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1253511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1253859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1254203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1254876Z     )
2025-05-07T20:32:12.1255250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1255711Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1255966Z         self,
2025-05-07T20:32:12.1256174Z         T: int,
2025-05-07T20:32:12.1256377Z         D: int,
2025-05-07T20:32:12.1256611Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1256898Z         contiguous: bool,
2025-05-07T20:32:12.1257147Z         compiled: bool,
2025-05-07T20:32:12.1257392Z     ) -> None:
2025-05-07T20:32:12.1257622Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1257875Z     
2025-05-07T20:32:12.1258164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1258530Z     
2025-05-07T20:32:12.1258731Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1259041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1259374Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1259646Z         x0 = x[:, :D]
2025-05-07T20:32:12.1259881Z         x1 = x[:, D:]
2025-05-07T20:32:12.1260103Z     
2025-05-07T20:32:12.1260299Z         if contiguous:
2025-05-07T20:32:12.1260548Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1260825Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1261198Z     
2025-05-07T20:32:12.1261405Z         if scale_ub is not None:
2025-05-07T20:32:12.1261696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1262052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1262372Z             )
2025-05-07T20:32:12.1262577Z         else:
2025-05-07T20:32:12.1262793Z             scale_ub_tensor = None
2025-05-07T20:32:12.1263056Z     
2025-05-07T20:32:12.1263300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1263622Z             op = silu_mul_quant
2025-05-07T20:32:12.1263885Z             if compiled:
2025-05-07T20:32:12.1264146Z                 op = torch.compile(op)
2025-05-07T20:32:12.1264620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1264914Z     
2025-05-07T20:32:12.1265122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1265292Z 
2025-05-07T20:32:12.1265408Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1265713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1266060Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1266360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1267063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1267761Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1268312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1269001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1269681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1270242Z     kernel = self.compile(
2025-05-07T20:32:12.1270793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1271452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1271864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1272107Z 
2025-05-07T20:32:12.1272321Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea60f2b0>
2025-05-07T20:32:12.1273407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1274920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea920430>}
2025-05-07T20:32:12.1276284Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1277327Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea5da570>
2025-05-07T20:32:12.1277620Z 
2025-05-07T20:32:12.1277799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1278332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1278802Z                            module_map=module_map)
2025-05-07T20:32:12.1279181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1279546Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1279807Z E       ^
2025-05-07T20:32:12.1280289Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1280742Z 
2025-05-07T20:32:12.1281167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1281685Z 
2025-05-07T20:32:12.1281798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1282217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1282629Z     T=128,
2025-05-07T20:32:12.1282828Z     D=5120,
2025-05-07T20:32:12.1283025Z     scale_ub=None,
2025-05-07T20:32:12.1283251Z     contiguous=False,
2025-05-07T20:32:12.1283487Z     compiled=False,
2025-05-07T20:32:12.1283701Z )
2025-05-07T20:32:12.1284030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1284535Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.1284806Z 
2025-05-07T20:32:12.1284894Z     @given(
2025-05-07T20:32:12.1285134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1285552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1285871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1286209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1286555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1286859Z     )
2025-05-07T20:32:12.1287217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1287670Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1287926Z         self,
2025-05-07T20:32:12.1288125Z         T: int,
2025-05-07T20:32:12.1288334Z         D: int,
2025-05-07T20:32:12.1288569Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1288848Z         contiguous: bool,
2025-05-07T20:32:12.1289106Z         compiled: bool,
2025-05-07T20:32:12.1289347Z     ) -> None:
2025-05-07T20:32:12.1289575Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1289823Z     
2025-05-07T20:32:12.1290119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1290478Z     
2025-05-07T20:32:12.1290675Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1290983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1291312Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1291562Z         x0 = x[:, :D]
2025-05-07T20:32:12.1291788Z         x1 = x[:, D:]
2025-05-07T20:32:12.1292006Z     
2025-05-07T20:32:12.1292196Z         if contiguous:
2025-05-07T20:32:12.1292440Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1292710Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1292954Z     
2025-05-07T20:32:12.1293158Z         if scale_ub is not None:
2025-05-07T20:32:12.1293446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1293784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1294102Z             )
2025-05-07T20:32:12.1294308Z         else:
2025-05-07T20:32:12.1294531Z             scale_ub_tensor = None
2025-05-07T20:32:12.1294906Z     
2025-05-07T20:32:12.1295155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1295484Z             op = silu_mul_quant
2025-05-07T20:32:12.1295742Z             if compiled:
2025-05-07T20:32:12.1296004Z                 op = torch.compile(op)
2025-05-07T20:32:12.1296311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1296592Z     
2025-05-07T20:32:12.1296798Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1296970Z 
2025-05-07T20:32:12.1297082Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1297381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1297725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1298024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1298728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1299421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1299980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1300674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1301405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1301958Z     kernel = self.compile(
2025-05-07T20:32:12.1302516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1303179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1303584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1303829Z 
2025-05-07T20:32:12.1304045Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb8c34f0>
2025-05-07T20:32:12.1305138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1306615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead645e0>}
2025-05-07T20:32:12.1307953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1308976Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea639a70>
2025-05-07T20:32:12.1309274Z 
2025-05-07T20:32:12.1309446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1309984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1310462Z                            module_map=module_map)
2025-05-07T20:32:12.1310843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1311213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1311490Z E       ^
2025-05-07T20:32:12.1311963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1312419Z 
2025-05-07T20:32:12.1312837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1313347Z 
2025-05-07T20:32:12.1313461Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1313877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1314287Z     T=128,
2025-05-07T20:32:12.1314484Z     D=5120,
2025-05-07T20:32:12.1314689Z     scale_ub=1200.0,
2025-05-07T20:32:12.1314914Z     contiguous=True,
2025-05-07T20:32:12.1315148Z     compiled=False,
2025-05-07T20:32:12.1315369Z )
2025-05-07T20:32:12.3612564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3613351Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.3613677Z 
2025-05-07T20:32:12.3613793Z     @given(
2025-05-07T20:32:12.3614075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3614401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3614716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3615062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3615407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3615700Z     )
2025-05-07T20:32:12.3616065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3616521Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3616774Z         self,
2025-05-07T20:32:12.3616974Z         T: int,
2025-05-07T20:32:12.3617188Z         D: int,
2025-05-07T20:32:12.3617444Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3617723Z         contiguous: bool,
2025-05-07T20:32:12.3617976Z         compiled: bool,
2025-05-07T20:32:12.3618214Z     ) -> None:
2025-05-07T20:32:12.3618437Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3618693Z     
2025-05-07T20:32:12.3618979Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3619326Z     
2025-05-07T20:32:12.3619531Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3619839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3620161Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3620416Z         x0 = x[:, :D]
2025-05-07T20:32:12.3620649Z         x1 = x[:, D:]
2025-05-07T20:32:12.3620861Z     
2025-05-07T20:32:12.3621151Z         if contiguous:
2025-05-07T20:32:12.3621443Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3621719Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3621967Z     
2025-05-07T20:32:12.3622369Z         if scale_ub is not None:
2025-05-07T20:32:12.3622654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3622995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3623318Z             )
2025-05-07T20:32:12.3623523Z         else:
2025-05-07T20:32:12.3623737Z             scale_ub_tensor = None
2025-05-07T20:32:12.3623999Z     
2025-05-07T20:32:12.3624242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3624561Z             op = silu_mul_quant
2025-05-07T20:32:12.3624823Z             if compiled:
2025-05-07T20:32:12.3625081Z                 op = torch.compile(op)
2025-05-07T20:32:12.3625386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3625672Z     
2025-05-07T20:32:12.3625874Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3626048Z 
2025-05-07T20:32:12.3626158Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3626461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3626817Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3627112Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3627813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3628507Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3629060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3629754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3630429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3630965Z     kernel = self.compile(
2025-05-07T20:32:12.3631529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3632283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3632702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3632939Z 
2025-05-07T20:32:12.3633151Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea15fd90>
2025-05-07T20:32:12.3634239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3635624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb2c3b80>}
2025-05-07T20:32:12.3636970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3637995Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea697070>
2025-05-07T20:32:12.3638303Z 
2025-05-07T20:32:12.3638480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3639018Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3639495Z                            module_map=module_map)
2025-05-07T20:32:12.3639871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3640530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3640802Z E       ^
2025-05-07T20:32:12.3641269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3641726Z 
2025-05-07T20:32:12.3642152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3642675Z 
2025-05-07T20:32:12.3642782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3643386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3643790Z     T=1,
2025-05-07T20:32:12.3643988Z     D=7168,
2025-05-07T20:32:12.3644192Z     scale_ub=1200.0,
2025-05-07T20:32:12.3644420Z     contiguous=True,
2025-05-07T20:32:12.3644654Z     compiled=True,
2025-05-07T20:32:12.3644874Z )
2025-05-07T20:32:12.3645196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3645698Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.3645968Z 
2025-05-07T20:32:12.3646051Z     @given(
2025-05-07T20:32:12.3646297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3646616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3646934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3647278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3647611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3647916Z     )
2025-05-07T20:32:12.3648280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3648728Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3648982Z         self,
2025-05-07T20:32:12.3649188Z         T: int,
2025-05-07T20:32:12.3649391Z         D: int,
2025-05-07T20:32:12.3649619Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3649907Z         contiguous: bool,
2025-05-07T20:32:12.3650158Z         compiled: bool,
2025-05-07T20:32:12.3650385Z     ) -> None:
2025-05-07T20:32:12.3650615Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3650868Z     
2025-05-07T20:32:12.3651144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3651504Z     
2025-05-07T20:32:12.3651706Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3652008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3652334Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3652587Z         x0 = x[:, :D]
2025-05-07T20:32:12.3652937Z         x1 = x[:, D:]
2025-05-07T20:32:12.3653161Z     
2025-05-07T20:32:12.3653355Z         if contiguous:
2025-05-07T20:32:12.3653592Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3653865Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3654122Z     
2025-05-07T20:32:12.3654318Z         if scale_ub is not None:
2025-05-07T20:32:12.3654606Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3654951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3655278Z             )
2025-05-07T20:32:12.3655474Z         else:
2025-05-07T20:32:12.3655697Z             scale_ub_tensor = None
2025-05-07T20:32:12.3655962Z     
2025-05-07T20:32:12.3656199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3656526Z             op = silu_mul_quant
2025-05-07T20:32:12.3656789Z             if compiled:
2025-05-07T20:32:12.3657058Z                 op = torch.compile(op)
2025-05-07T20:32:12.3657377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3657673Z     
2025-05-07T20:32:12.3657876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3658047Z 
2025-05-07T20:32:12.3658150Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3658487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3658823Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3659116Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3659694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.3668503Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.3669204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3669910Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3670465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3671294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3671979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3672530Z     kernel = self.compile(
2025-05-07T20:32:12.3673092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3673751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3674169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3674411Z 
2025-05-07T20:32:12.3674633Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea69ba30>
2025-05-07T20:32:12.3675722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3677125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea58e820>}
2025-05-07T20:32:12.3678491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3679526Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea5604f0>
2025-05-07T20:32:12.3679822Z 
2025-05-07T20:32:12.3680011Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3680552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3681035Z                            module_map=module_map)
2025-05-07T20:32:12.3681424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3681878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3682149Z E       ^
2025-05-07T20:32:12.3682624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3683081Z 
2025-05-07T20:32:12.3683513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3684028Z 
2025-05-07T20:32:12.3684148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3684571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3684990Z     T=1,
2025-05-07T20:32:12.3685191Z     D=7168,
2025-05-07T20:32:12.3685389Z     scale_ub=1200.0,
2025-05-07T20:32:12.3685636Z     contiguous=False,
2025-05-07T20:32:12.3685878Z     compiled=True,
2025-05-07T20:32:12.3686093Z )
2025-05-07T20:32:12.5323304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.5324037Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.5324381Z 
2025-05-07T20:32:12.5324472Z     @given(
2025-05-07T20:32:12.5324725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.5325055Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.5325385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.5325722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.5326072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.5326378Z     )
2025-05-07T20:32:12.5326734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.5327194Z     def test_silu_mul_quant(
2025-05-07T20:32:12.5327452Z         self,
2025-05-07T20:32:12.5327656Z         T: int,
2025-05-07T20:32:12.5327869Z         D: int,
2025-05-07T20:32:12.5328108Z         scale_ub: Optional[float],
2025-05-07T20:32:12.5328387Z         contiguous: bool,
2025-05-07T20:32:12.5329025Z         compiled: bool,
2025-05-07T20:32:12.5329270Z     ) -> None:
2025-05-07T20:32:12.5329493Z         torch.manual_seed(2025)
2025-05-07T20:32:12.5329754Z     
2025-05-07T20:32:12.5330045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.5330403Z     
2025-05-07T20:32:12.5330605Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.5330915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.5331243Z         x = x_sign * x_clamp
2025-05-07T20:32:12.5331497Z         x0 = x[:, :D]
2025-05-07T20:32:12.5331732Z         x1 = x[:, D:]
2025-05-07T20:32:12.5331955Z     
2025-05-07T20:32:12.5332149Z         if contiguous:
2025-05-07T20:32:12.5332396Z             x0 = x0.contiguous()
2025-05-07T20:32:12.5332669Z             x1 = x1.contiguous()
2025-05-07T20:32:12.5332918Z     
2025-05-07T20:32:12.5333130Z         if scale_ub is not None:
2025-05-07T20:32:12.5333418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.5333765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.5334100Z             )
2025-05-07T20:32:12.5334308Z         else:
2025-05-07T20:32:12.5334524Z             scale_ub_tensor = None
2025-05-07T20:32:12.5334790Z     
2025-05-07T20:32:12.5335038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.5335370Z             op = silu_mul_quant
2025-05-07T20:32:12.5335631Z             if compiled:
2025-05-07T20:32:12.5335899Z                 op = torch.compile(op)
2025-05-07T20:32:12.5336212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5336496Z     
2025-05-07T20:32:12.5336706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.5336876Z 
2025-05-07T20:32:12.5336991Z moe/activation_test.py:117: 
2025-05-07T20:32:12.5337297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5337649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.5337962Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5338688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.5339266Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.5339941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.5340918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.5341531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.5342276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.5342962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.5343512Z     kernel = self.compile(
2025-05-07T20:32:12.5344064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.5344737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.5345147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5345381Z 
2025-05-07T20:32:12.5345606Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea563d30>
2025-05-07T20:32:12.5346709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.5348090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea4944c0>}
2025-05-07T20:32:12.5349435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.5350837Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4bb970>
2025-05-07T20:32:12.5351209Z 
2025-05-07T20:32:12.5351430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.5352045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.5352604Z                            module_map=module_map)
2025-05-07T20:32:12.5353027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.5353429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.5353728Z E       ^
2025-05-07T20:32:12.5354281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.5354838Z 
2025-05-07T20:32:12.5355353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.5355983Z 
2025-05-07T20:32:12.5356109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.5356591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.5357071Z     T=1,
2025-05-07T20:32:12.5357274Z     D=7168,
2025-05-07T20:32:12.5357491Z     scale_ub=None,
2025-05-07T20:32:12.5357731Z     contiguous=False,
2025-05-07T20:32:12.5357983Z     compiled=True,
2025-05-07T20:32:12.5358211Z )
2025-05-07T20:32:12.8366579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8367352Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.8367715Z 
2025-05-07T20:32:12.8367824Z     @given(
2025-05-07T20:32:12.8368135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8368509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8368835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8369188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8369885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8370201Z     )
2025-05-07T20:32:12.8370574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8371037Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8371291Z         self,
2025-05-07T20:32:12.8371504Z         T: int,
2025-05-07T20:32:12.8371718Z         D: int,
2025-05-07T20:32:12.8371975Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8372293Z         contiguous: bool,
2025-05-07T20:32:12.8372546Z         compiled: bool,
2025-05-07T20:32:12.8372781Z     ) -> None:
2025-05-07T20:32:12.8373011Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8373266Z     
2025-05-07T20:32:12.8373546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8373907Z     
2025-05-07T20:32:12.8374113Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8374419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8374740Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8375010Z         x0 = x[:, :D]
2025-05-07T20:32:12.8375243Z         x1 = x[:, D:]
2025-05-07T20:32:12.8375461Z     
2025-05-07T20:32:12.8375667Z         if contiguous:
2025-05-07T20:32:12.8375917Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8376187Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8376443Z     
2025-05-07T20:32:12.8376650Z         if scale_ub is not None:
2025-05-07T20:32:12.8376933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8377287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8377613Z             )
2025-05-07T20:32:12.8377818Z         else:
2025-05-07T20:32:12.8378048Z             scale_ub_tensor = None
2025-05-07T20:32:12.8378317Z     
2025-05-07T20:32:12.8378560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8378893Z             op = silu_mul_quant
2025-05-07T20:32:12.8379163Z             if compiled:
2025-05-07T20:32:12.8379425Z                 op = torch.compile(op)
2025-05-07T20:32:12.8379911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8380202Z     
2025-05-07T20:32:12.8380410Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.8380705Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.8381012Z     
2025-05-07T20:32:12.8381360Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8381710Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.8382013Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.8382343Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.8382718Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.8383034Z     
2025-05-07T20:32:12.8383249Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.8383451Z 
2025-05-07T20:32:12.8383571Z moe/activation_test.py:126: 
2025-05-07T20:32:12.8383875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8384242Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.8384591Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.8385396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.8386158Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.8386722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8387418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8388121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.8388865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.8389714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.8390486Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.8391216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.8391867Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.8392483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.8393018Z     fn()
2025-05-07T20:32:12.8393536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.8394127Z     self.fn.run(
2025-05-07T20:32:12.8394613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8395149Z     kernel = self.compile(
2025-05-07T20:32:12.8395713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8396380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8396795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8397030Z 
2025-05-07T20:32:12.8397246Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea361eb0>
2025-05-07T20:32:12.8398336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8399724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea51e040>}
2025-05-07T20:32:12.8401097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8402267Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea51d830>
2025-05-07T20:32:12.8402562Z 
2025-05-07T20:32:12.8402735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8403272Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8403751Z                            module_map=module_map)
2025-05-07T20:32:12.8404129Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8404497Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.8404781Z E       ^
2025-05-07T20:32:12.8405247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8405707Z 
2025-05-07T20:32:12.8406140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8406680Z 
2025-05-07T20:32:12.8406789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8407222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8407633Z     T=1,
2025-05-07T20:32:12.8407833Z     D=5120,
2025-05-07T20:32:12.8408043Z     scale_ub=1200.0,
2025-05-07T20:32:12.8408278Z     contiguous=False,
2025-05-07T20:32:12.8408520Z     compiled=True,
2025-05-07T20:32:12.8408743Z )
2025-05-07T20:32:13.0410435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0410970Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.0411378Z 
2025-05-07T20:32:13.0411543Z     @given(
2025-05-07T20:32:13.0412027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0412666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0413297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0414533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0415223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0415797Z     )
2025-05-07T20:32:13.0416508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0417401Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0417890Z         self,
2025-05-07T20:32:13.0418292Z         T: int,
2025-05-07T20:32:13.0418702Z         D: int,
2025-05-07T20:32:13.0419144Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0419701Z         contiguous: bool,
2025-05-07T20:32:13.0420197Z         compiled: bool,
2025-05-07T20:32:13.0420654Z     ) -> None:
2025-05-07T20:32:13.0421230Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0421510Z     
2025-05-07T20:32:13.0421812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0422200Z     
2025-05-07T20:32:13.0422397Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0422712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0423045Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0423305Z         x0 = x[:, :D]
2025-05-07T20:32:13.0423532Z         x1 = x[:, D:]
2025-05-07T20:32:13.0423752Z     
2025-05-07T20:32:13.0423944Z         if contiguous:
﻿2025-05-07T20:32:13.0428274Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0428550Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0428796Z     
2025-05-07T20:32:13.0429002Z         if scale_ub is not None:
2025-05-07T20:32:13.0429289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0429637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0429963Z             )
2025-05-07T20:32:13.0430168Z         else:
2025-05-07T20:32:13.0430383Z             scale_ub_tensor = None
2025-05-07T20:32:13.0430653Z     
2025-05-07T20:32:13.0430899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0431226Z             op = silu_mul_quant
2025-05-07T20:32:13.0431592Z             if compiled:
2025-05-07T20:32:13.0431857Z                 op = torch.compile(op)
2025-05-07T20:32:13.0432167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0432449Z     
2025-05-07T20:32:13.0432655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0432826Z 
2025-05-07T20:32:13.0432960Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0433271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0433616Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0433907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0434476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.0435054Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.0435725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0436424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0436989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0437680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0438344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0438892Z     kernel = self.compile(
2025-05-07T20:32:13.0439443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0440423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0440836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0441081Z 
2025-05-07T20:32:13.0441294Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea3a6760>
2025-05-07T20:32:13.0442508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0444254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea51ef70>}
2025-05-07T20:32:13.0445943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0447216Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea33d0f0>
2025-05-07T20:32:13.0447565Z 
2025-05-07T20:32:13.0447752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0448377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0448935Z                            module_map=module_map)
2025-05-07T20:32:13.0449353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0449760Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0450047Z E       ^
2025-05-07T20:32:13.0450598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0451166Z 
2025-05-07T20:32:13.0451594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0452165Z 
2025-05-07T20:32:13.0452271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0452696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0453097Z     T=1,
2025-05-07T20:32:13.0453291Z     D=5120,
2025-05-07T20:32:13.0453497Z     scale_ub=1200.0,
2025-05-07T20:32:13.0453726Z     contiguous=False,
2025-05-07T20:32:13.0453965Z     compiled=False,
2025-05-07T20:32:13.0454250Z )
2025-05-07T20:32:13.0454574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0455078Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.0455357Z 
2025-05-07T20:32:13.0455439Z     @given(
2025-05-07T20:32:13.0455678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0456000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0456319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0456660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0456997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0457295Z     )
2025-05-07T20:32:13.0457653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0458097Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0458349Z         self,
2025-05-07T20:32:13.0458550Z         T: int,
2025-05-07T20:32:13.0458749Z         D: int,
2025-05-07T20:32:13.0458984Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0459265Z         contiguous: bool,
2025-05-07T20:32:13.0459508Z         compiled: bool,
2025-05-07T20:32:13.0459740Z     ) -> None:
2025-05-07T20:32:13.0459964Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0460213Z     
2025-05-07T20:32:13.0460489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0460841Z     
2025-05-07T20:32:13.0461043Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0461401Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0461722Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0461972Z         x0 = x[:, :D]
2025-05-07T20:32:13.0462191Z         x1 = x[:, D:]
2025-05-07T20:32:13.0462405Z     
2025-05-07T20:32:13.0462596Z         if contiguous:
2025-05-07T20:32:13.0462828Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0463094Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0463343Z     
2025-05-07T20:32:13.0463627Z         if scale_ub is not None:
2025-05-07T20:32:13.0463912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0464462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0464776Z             )
2025-05-07T20:32:13.0464979Z         else:
2025-05-07T20:32:13.0465199Z             scale_ub_tensor = None
2025-05-07T20:32:13.0465461Z     
2025-05-07T20:32:13.0465695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0466019Z             op = silu_mul_quant
2025-05-07T20:32:13.0466282Z             if compiled:
2025-05-07T20:32:13.0466531Z                 op = torch.compile(op)
2025-05-07T20:32:13.0466841Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0467127Z     
2025-05-07T20:32:13.0467322Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0467499Z 
2025-05-07T20:32:13.0467601Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0467908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0468251Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0468542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0469236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0469937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0470550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0471242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0471915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0472611Z     kernel = self.compile(
2025-05-07T20:32:13.0473159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0473826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0474286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0474518Z 
2025-05-07T20:32:13.0474729Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea4431f0>
2025-05-07T20:32:13.0475816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0477190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e373a0>}
2025-05-07T20:32:13.0478536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0479562Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9e167f0>
2025-05-07T20:32:13.0479856Z 
2025-05-07T20:32:13.0480027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0480560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0481035Z                            module_map=module_map)
2025-05-07T20:32:13.0481411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0481791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0482060Z E       ^
2025-05-07T20:32:13.0482523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0482980Z 
2025-05-07T20:32:13.0483404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0492117Z 
2025-05-07T20:32:13.0492256Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0492815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0493228Z     T=16384,
2025-05-07T20:32:13.0493444Z     D=5120,
2025-05-07T20:32:13.0493655Z     scale_ub=1200.0,
2025-05-07T20:32:13.0493889Z     contiguous=False,
2025-05-07T20:32:13.0494129Z     compiled=True,
2025-05-07T20:32:13.0494354Z )
2025-05-07T20:32:13.1664619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1665150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.1665445Z 
2025-05-07T20:32:13.1665529Z     @given(
2025-05-07T20:32:13.1665778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1666109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1666428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1666776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1667126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1667456Z     )
2025-05-07T20:32:13.1667820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1668278Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1668523Z         self,
2025-05-07T20:32:13.1668732Z         T: int,
2025-05-07T20:32:13.1668945Z         D: int,
2025-05-07T20:32:13.1669443Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1669737Z         contiguous: bool,
2025-05-07T20:32:13.1669992Z         compiled: bool,
2025-05-07T20:32:13.1670226Z     ) -> None:
2025-05-07T20:32:13.1670455Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1670714Z     
2025-05-07T20:32:13.1670991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1671351Z     
2025-05-07T20:32:13.1671554Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1671853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1672165Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1672520Z         x0 = x[:, :D]
2025-05-07T20:32:13.1672749Z         x1 = x[:, D:]
2025-05-07T20:32:13.1672960Z     
2025-05-07T20:32:13.1673161Z         if contiguous:
2025-05-07T20:32:13.1673406Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1673673Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1673929Z     
2025-05-07T20:32:13.1674139Z         if scale_ub is not None:
2025-05-07T20:32:13.1674418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1674771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1675097Z             )
2025-05-07T20:32:13.1675296Z         else:
2025-05-07T20:32:13.1675521Z             scale_ub_tensor = None
2025-05-07T20:32:13.1675790Z     
2025-05-07T20:32:13.1676028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1676361Z             op = silu_mul_quant
2025-05-07T20:32:13.1676631Z             if compiled:
2025-05-07T20:32:13.1676895Z                 op = torch.compile(op)
2025-05-07T20:32:13.1677208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1677503Z     
2025-05-07T20:32:13.1677711Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1677885Z 
2025-05-07T20:32:13.1678024Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1678327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1678676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1678975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1679544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.1680123Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.1680798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1681491Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1682030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1682887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1683564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1684100Z     kernel = self.compile(
2025-05-07T20:32:13.1684665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1685342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1685750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1685984Z 
2025-05-07T20:32:13.1686197Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea3114f0>
2025-05-07T20:32:13.1687285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1688672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea45b0d0>}
2025-05-07T20:32:13.1690042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1691130Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4585f0>
2025-05-07T20:32:13.1691425Z 
2025-05-07T20:32:13.1691597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1692132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1692604Z                            module_map=module_map)
2025-05-07T20:32:13.1692975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1693392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1693664Z E       ^
2025-05-07T20:32:13.1694133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1694592Z 
2025-05-07T20:32:13.1695010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1695534Z 
2025-05-07T20:32:13.1695639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1696059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1696470Z     T=2048,
2025-05-07T20:32:13.1696664Z     D=7168,
2025-05-07T20:32:13.1696868Z     scale_ub=1200.0,
2025-05-07T20:32:13.1697100Z     contiguous=False,
2025-05-07T20:32:13.1697330Z     compiled=True,
2025-05-07T20:32:13.1697551Z )
2025-05-07T20:32:13.1697880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1698391Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.1698676Z 
2025-05-07T20:32:13.1698759Z     @given(
2025-05-07T20:32:13.1699003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1699322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1699649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1699990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1700330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1700625Z     )
2025-05-07T20:32:13.1700986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1701548Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1701795Z         self,
2025-05-07T20:32:13.1702006Z         T: int,
2025-05-07T20:32:13.1702216Z         D: int,
2025-05-07T20:32:13.1702440Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1702728Z         contiguous: bool,
2025-05-07T20:32:13.1703074Z         compiled: bool,
2025-05-07T20:32:13.1703305Z     ) -> None:
2025-05-07T20:32:13.1703533Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1703790Z     
2025-05-07T20:32:13.1704073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1704425Z     
2025-05-07T20:32:13.1704620Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1704924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1705249Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1705504Z         x0 = x[:, :D]
2025-05-07T20:32:13.1705721Z         x1 = x[:, D:]
2025-05-07T20:32:13.1705936Z     
2025-05-07T20:32:13.1706129Z         if contiguous:
2025-05-07T20:32:13.1706359Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1706626Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1706876Z     
2025-05-07T20:32:13.1707066Z         if scale_ub is not None:
2025-05-07T20:32:13.1707349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1707699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1708008Z             )
2025-05-07T20:32:13.1708205Z         else:
2025-05-07T20:32:13.1708424Z             scale_ub_tensor = None
2025-05-07T20:32:13.1708677Z     
2025-05-07T20:32:13.1708916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1709291Z             op = silu_mul_quant
2025-05-07T20:32:13.1709544Z             if compiled:
2025-05-07T20:32:13.1709802Z                 op = torch.compile(op)
2025-05-07T20:32:13.1710107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1710389Z     
2025-05-07T20:32:13.1710582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1710759Z 
2025-05-07T20:32:13.1710863Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1711170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1711551Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1711841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1712463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.1713020Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.1713686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1714380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1714923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1715600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1716268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1716816Z     kernel = self.compile(
2025-05-07T20:32:13.1717355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1718028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1718431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1718664Z 
2025-05-07T20:32:13.1718881Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea508ee0>
2025-05-07T20:32:13.1719960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1721330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea45bca0>}
2025-05-07T20:32:13.1722803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1723827Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9dc23f0>
2025-05-07T20:32:13.1724119Z 
2025-05-07T20:32:13.1724296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1724823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1725301Z                            module_map=module_map)
2025-05-07T20:32:13.1725677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1726029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1726303Z E       ^
2025-05-07T20:32:13.1726772Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1727218Z 
2025-05-07T20:32:13.1727641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1728155Z 
2025-05-07T20:32:13.4418878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4419358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4419776Z     T=1,
2025-05-07T20:32:13.4419979Z     D=5120,
2025-05-07T20:32:13.4420261Z     scale_ub=None,
2025-05-07T20:32:13.4420585Z     contiguous=False,
2025-05-07T20:32:13.4421191Z     compiled=False,
2025-05-07T20:32:13.4421417Z )
2025-05-07T20:32:13.4421751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4422253Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.4422527Z 
2025-05-07T20:32:13.4422610Z     @given(
2025-05-07T20:32:13.4422854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4423173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4423495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4423840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4424332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4424654Z     )
2025-05-07T20:32:13.4425061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4425583Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4425847Z         self,
2025-05-07T20:32:13.4426062Z         T: int,
2025-05-07T20:32:13.4426270Z         D: int,
2025-05-07T20:32:13.4426493Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4426778Z         contiguous: bool,
2025-05-07T20:32:13.4427060Z         compiled: bool,
2025-05-07T20:32:13.4427297Z     ) -> None:
2025-05-07T20:32:13.4427523Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4427819Z     
2025-05-07T20:32:13.4428100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4428451Z     
2025-05-07T20:32:13.4428648Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4428954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4429281Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4429532Z         x0 = x[:, :D]
2025-05-07T20:32:13.4429760Z         x1 = x[:, D:]
2025-05-07T20:32:13.4429980Z     
2025-05-07T20:32:13.4430174Z         if contiguous:
2025-05-07T20:32:13.4430421Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4430696Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4430945Z     
2025-05-07T20:32:13.4431147Z         if scale_ub is not None:
2025-05-07T20:32:13.4431460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4431838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4432154Z             )
2025-05-07T20:32:13.4432360Z         else:
2025-05-07T20:32:13.4432584Z             scale_ub_tensor = None
2025-05-07T20:32:13.4432840Z     
2025-05-07T20:32:13.4433084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4433418Z             op = silu_mul_quant
2025-05-07T20:32:13.4433681Z             if compiled:
2025-05-07T20:32:13.4434090Z                 op = torch.compile(op)
2025-05-07T20:32:13.4434407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4434692Z     
2025-05-07T20:32:13.4434896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4435066Z 
2025-05-07T20:32:13.4435181Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4435487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4435829Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4436121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4436823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4437517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4438077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4438770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4439453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4439999Z     kernel = self.compile(
2025-05-07T20:32:13.4440866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4441629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4442034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4442278Z 
2025-05-07T20:32:13.4442490Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea1f3a30>
2025-05-07T20:32:13.4443613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4445011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea15a670>}
2025-05-07T20:32:13.4446474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4447496Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea065c70>
2025-05-07T20:32:13.4447797Z 
2025-05-07T20:32:13.4447968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4448502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4448978Z                            module_map=module_map)
2025-05-07T20:32:13.4449353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4449718Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4449989Z E       ^
2025-05-07T20:32:13.4450458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4450919Z 
2025-05-07T20:32:13.4451346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4451865Z 
2025-05-07T20:32:13.4451977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4452409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4452816Z     T=4096,
2025-05-07T20:32:13.4453016Z     D=7168,
2025-05-07T20:32:13.4453224Z     scale_ub=1200.0,
2025-05-07T20:32:13.4453457Z     contiguous=False,
2025-05-07T20:32:13.4453696Z     compiled=False,
2025-05-07T20:32:13.4453913Z )
2025-05-07T20:32:13.4454243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4454754Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.4455037Z 
2025-05-07T20:32:13.4455132Z     @given(
2025-05-07T20:32:13.4455541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4455864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4456185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4456525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4456860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4457157Z     )
2025-05-07T20:32:13.4457516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4457960Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4458209Z         self,
2025-05-07T20:32:13.4458412Z         T: int,
2025-05-07T20:32:13.4458611Z         D: int,
2025-05-07T20:32:13.4458837Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4459124Z         contiguous: bool,
2025-05-07T20:32:13.4459373Z         compiled: bool,
2025-05-07T20:32:13.4459606Z     ) -> None:
2025-05-07T20:32:13.4459834Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4460130Z     
2025-05-07T20:32:13.4460415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4460769Z     
2025-05-07T20:32:13.4460968Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4461344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4461724Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4461971Z         x0 = x[:, :D]
2025-05-07T20:32:13.4462202Z         x1 = x[:, D:]
2025-05-07T20:32:13.4462451Z     
2025-05-07T20:32:13.4462639Z         if contiguous:
2025-05-07T20:32:13.4462883Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4463155Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4463399Z     
2025-05-07T20:32:13.4463600Z         if scale_ub is not None:
2025-05-07T20:32:13.4463886Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4464231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4464543Z             )
2025-05-07T20:32:13.4464744Z         else:
2025-05-07T20:32:13.4465046Z             scale_ub_tensor = None
2025-05-07T20:32:13.4465298Z     
2025-05-07T20:32:13.4465544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4465865Z             op = silu_mul_quant
2025-05-07T20:32:13.4466120Z             if compiled:
2025-05-07T20:32:13.4466377Z                 op = torch.compile(op)
2025-05-07T20:32:13.4466686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4466966Z     
2025-05-07T20:32:13.4467169Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4467340Z 
2025-05-07T20:32:13.4467449Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4467747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4468089Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4468382Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4469079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4469783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4470330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4471026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4471704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4472241Z     kernel = self.compile(
2025-05-07T20:32:13.4472800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4473503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4473909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4474150Z 
2025-05-07T20:32:13.4474361Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea083640>
2025-05-07T20:32:13.4475569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4476947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea256040>}
2025-05-07T20:32:13.4478304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4479323Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea251d30>
2025-05-07T20:32:13.4479625Z 
2025-05-07T20:32:13.4479801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4480341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4480830Z                            module_map=module_map)
2025-05-07T20:32:13.4481201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4481561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4481829Z E       ^
2025-05-07T20:32:13.4482291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4482801Z 
2025-05-07T20:32:13.4483219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4483741Z 
2025-05-07T20:32:13.4483848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4484278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4484681Z     T=16384,
2025-05-07T20:32:13.4484888Z     D=7168,
2025-05-07T20:32:13.4485090Z     scale_ub=None,
2025-05-07T20:32:13.4485307Z     contiguous=True,
2025-05-07T20:32:13.4485589Z     compiled=True,
2025-05-07T20:32:13.4485802Z )
2025-05-07T20:32:13.7576272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7588754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.7589168Z 
2025-05-07T20:32:13.7589317Z     @given(
2025-05-07T20:32:13.7589644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7590114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7590438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7590777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7591125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7591431Z     )
2025-05-07T20:32:13.7591777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7592231Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7592483Z         self,
2025-05-07T20:32:13.7592681Z         T: int,
2025-05-07T20:32:13.7592905Z         D: int,
2025-05-07T20:32:13.7593133Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7593411Z         contiguous: bool,
2025-05-07T20:32:13.7593669Z         compiled: bool,
2025-05-07T20:32:13.7593914Z     ) -> None:
2025-05-07T20:32:13.7594174Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7594427Z     
2025-05-07T20:32:13.7594705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7595060Z     
2025-05-07T20:32:13.7595263Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.7595554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.7595880Z         x = x_sign * x_clamp
2025-05-07T20:32:13.7596131Z         x0 = x[:, :D]
2025-05-07T20:32:13.7596356Z         x1 = x[:, D:]
2025-05-07T20:32:13.7596563Z     
2025-05-07T20:32:13.7596757Z         if contiguous:
2025-05-07T20:32:13.7596999Z             x0 = x0.contiguous()
2025-05-07T20:32:13.7597263Z             x1 = x1.contiguous()
2025-05-07T20:32:13.7597852Z     
2025-05-07T20:32:13.7598058Z         if scale_ub is not None:
2025-05-07T20:32:13.7598333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.7598682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.7599005Z             )
2025-05-07T20:32:13.7599197Z         else:
2025-05-07T20:32:13.7599418Z             scale_ub_tensor = None
2025-05-07T20:32:13.7599679Z     
2025-05-07T20:32:13.7599912Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.7600234Z             op = silu_mul_quant
2025-05-07T20:32:13.7600497Z             if compiled:
2025-05-07T20:32:13.7600747Z                 op = torch.compile(op)
2025-05-07T20:32:13.7601056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7601341Z     
2025-05-07T20:32:13.7601539Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.7601705Z 
2025-05-07T20:32:13.7601808Z moe/activation_test.py:117: 
2025-05-07T20:32:13.7602116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7602463Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.7602748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7603326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.7604003Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.7604661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.7605367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.7605920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.7606613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.7607284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.7607925Z     kernel = self.compile(
2025-05-07T20:32:13.7608485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.7609146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.7609552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7609798Z 
2025-05-07T20:32:13.7610012Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea07e1c0>
2025-05-07T20:32:13.7611096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.7612481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea256ca0>}
2025-05-07T20:32:13.7613842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.7614862Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9f5aef0>
2025-05-07T20:32:13.7615163Z 
2025-05-07T20:32:13.7615335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.7615874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.7616339Z                            module_map=module_map)
2025-05-07T20:32:13.7616767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.7617129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.7617389Z E       ^
2025-05-07T20:32:13.7617861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.7618328Z 
2025-05-07T20:32:13.7618839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.7619360Z 
2025-05-07T20:32:13.7619472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7619884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7620297Z     T=4096,
2025-05-07T20:32:13.7620494Z     D=5120,
2025-05-07T20:32:13.7620687Z     scale_ub=None,
2025-05-07T20:32:13.7620920Z     contiguous=False,
2025-05-07T20:32:13.7621246Z     compiled=True,
2025-05-07T20:32:13.7621461Z )
2025-05-07T20:32:13.7621779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7622285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.7622556Z 
2025-05-07T20:32:13.7622644Z     @given(
2025-05-07T20:32:13.7622876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7623197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7623527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7623859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7624200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7624496Z     )
2025-05-07T20:32:13.7624850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7625355Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7625605Z         self,
2025-05-07T20:32:13.7625809Z         T: int,
2025-05-07T20:32:13.7626006Z         D: int,
2025-05-07T20:32:13.7626231Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7626516Z         contiguous: bool,
2025-05-07T20:32:13.7626760Z         compiled: bool,
2025-05-07T20:32:13.7626993Z     ) -> None:
2025-05-07T20:32:13.7627218Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7627462Z     
2025-05-07T20:32:13.7627741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7628145Z     
2025-05-07T20:32:13.7628343Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.7628644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.7628966Z         x = x_sign * x_clamp
2025-05-07T20:32:13.7629208Z         x0 = x[:, :D]
2025-05-07T20:32:13.7629435Z         x1 = x[:, D:]
2025-05-07T20:32:13.7629653Z     
2025-05-07T20:32:13.7629839Z         if contiguous:
2025-05-07T20:32:13.7630081Z             x0 = x0.contiguous()
2025-05-07T20:32:13.7630353Z             x1 = x1.contiguous()
2025-05-07T20:32:13.7630592Z     
2025-05-07T20:32:13.7630793Z         if scale_ub is not None:
2025-05-07T20:32:13.7631074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.7631410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.7631731Z             )
2025-05-07T20:32:13.7631946Z         else:
2025-05-07T20:32:13.7632188Z             scale_ub_tensor = None
2025-05-07T20:32:13.7632448Z     
2025-05-07T20:32:13.7632688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.7633002Z             op = silu_mul_quant
2025-05-07T20:32:13.7633265Z             if compiled:
2025-05-07T20:32:13.7633519Z                 op = torch.compile(op)
2025-05-07T20:32:13.7633815Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7634102Z     
2025-05-07T20:32:13.7634306Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.7634472Z 
2025-05-07T20:32:13.7634580Z moe/activation_test.py:117: 
2025-05-07T20:32:13.7634875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7635212Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.7635501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.7636058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.7636615Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.7637369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.7638074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.7638608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.7639291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.7639962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.7640779Z     kernel = self.compile(
2025-05-07T20:32:13.7641328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.7642041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.7642440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.7642673Z 
2025-05-07T20:32:13.7642892Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9f7ba60>
2025-05-07T20:32:13.7643974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.7645448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ffb8b0>}
2025-05-07T20:32:13.7646790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.7647805Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea09b070>
2025-05-07T20:32:13.7648096Z 
2025-05-07T20:32:13.7648264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.7648798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.7649347Z                            module_map=module_map)
2025-05-07T20:32:13.7649712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.7650066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.7650327Z E       ^
2025-05-07T20:32:13.7650784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.7651246Z 
2025-05-07T20:32:13.7651662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.7652178Z 
2025-05-07T20:32:13.9591748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9592462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9593010Z     T=4096,
2025-05-07T20:32:13.9593218Z     D=5120,
2025-05-07T20:32:13.9593430Z     scale_ub=1200.0,
2025-05-07T20:32:13.9593691Z     contiguous=False,
2025-05-07T20:32:13.9593932Z     compiled=False,
2025-05-07T20:32:13.9594156Z )
2025-05-07T20:32:13.9594480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9594998Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.9595286Z 
2025-05-07T20:32:13.9595381Z     @given(
2025-05-07T20:32:13.9595617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9595943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9596271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9596615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9596953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9597250Z     )
2025-05-07T20:32:13.9597611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9598067Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9598667Z         self,
2025-05-07T20:32:13.9598890Z         T: int,
2025-05-07T20:32:13.9599096Z         D: int,
2025-05-07T20:32:13.9599328Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9599611Z         contiguous: bool,
2025-05-07T20:32:13.9599859Z         compiled: bool,
2025-05-07T20:32:13.9600101Z     ) -> None:
2025-05-07T20:32:13.9600338Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9600587Z     
2025-05-07T20:32:13.9600873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9601239Z     
2025-05-07T20:32:13.9601444Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9601794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9602184Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9602492Z         x0 = x[:, :D]
2025-05-07T20:32:13.9602712Z         x1 = x[:, D:]
2025-05-07T20:32:13.9602928Z     
2025-05-07T20:32:13.9603124Z         if contiguous:
2025-05-07T20:32:13.9603361Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9603641Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9603892Z     
2025-05-07T20:32:13.9604087Z         if scale_ub is not None:
2025-05-07T20:32:13.9604375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9604724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9605181Z             )
2025-05-07T20:32:13.9605389Z         else:
2025-05-07T20:32:13.9605612Z             scale_ub_tensor = None
2025-05-07T20:32:13.9605865Z     
2025-05-07T20:32:13.9606106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9606430Z             op = silu_mul_quant
2025-05-07T20:32:13.9606683Z             if compiled:
2025-05-07T20:32:13.9606936Z                 op = torch.compile(op)
2025-05-07T20:32:13.9607241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9607524Z     
2025-05-07T20:32:13.9607716Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9607894Z 
2025-05-07T20:32:13.9608000Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9608393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9608729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9609020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9609722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9610421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9610970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9611664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9612336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9612929Z     kernel = self.compile(
2025-05-07T20:32:13.9613717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9614418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9614827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9615062Z 
2025-05-07T20:32:13.9615273Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9efc1f0>
2025-05-07T20:32:13.9616369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9617763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea3d0040>}
2025-05-07T20:32:13.9619220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9620264Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9ef4130>
2025-05-07T20:32:13.9620565Z 
2025-05-07T20:32:13.9620736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9621382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9621859Z                            module_map=module_map)
2025-05-07T20:32:13.9622229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9622592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9622868Z E       ^
2025-05-07T20:32:13.9623337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9623794Z 
2025-05-07T20:32:13.9624214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9624751Z 
2025-05-07T20:32:13.9624860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9625282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9625689Z     T=4096,
2025-05-07T20:32:13.9625887Z     D=5120,
2025-05-07T20:32:13.9626095Z     scale_ub=1200.0,
2025-05-07T20:32:13.9626381Z     contiguous=False,
2025-05-07T20:32:13.9626618Z     compiled=True,
2025-05-07T20:32:13.9626837Z )
2025-05-07T20:32:13.9627159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9627668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.9627956Z 
2025-05-07T20:32:13.9628036Z     @given(
2025-05-07T20:32:13.9628277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9628595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9628912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9629255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9629636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9629931Z     )
2025-05-07T20:32:13.9630287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9630733Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9630986Z         self,
2025-05-07T20:32:13.9631190Z         T: int,
2025-05-07T20:32:13.9631390Z         D: int,
2025-05-07T20:32:13.9631620Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9631900Z         contiguous: bool,
2025-05-07T20:32:13.9632146Z         compiled: bool,
2025-05-07T20:32:13.9632372Z     ) -> None:
2025-05-07T20:32:13.9632596Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9632848Z     
2025-05-07T20:32:13.9633121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9633473Z     
2025-05-07T20:32:13.9633671Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9633969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9634298Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9634548Z         x0 = x[:, :D]
2025-05-07T20:32:13.9634769Z         x1 = x[:, D:]
2025-05-07T20:32:13.9634984Z     
2025-05-07T20:32:13.9635180Z         if contiguous:
2025-05-07T20:32:13.9635422Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9635697Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9635944Z     
2025-05-07T20:32:13.9636147Z         if scale_ub is not None:
2025-05-07T20:32:13.9636431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9636773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9637097Z             )
2025-05-07T20:32:13.9637301Z         else:
2025-05-07T20:32:13.9637512Z             scale_ub_tensor = None
2025-05-07T20:32:13.9637775Z     
2025-05-07T20:32:13.9638015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9638334Z             op = silu_mul_quant
2025-05-07T20:32:13.9638596Z             if compiled:
2025-05-07T20:32:13.9638947Z                 op = torch.compile(op)
2025-05-07T20:32:13.9639254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9639539Z     
2025-05-07T20:32:13.9639740Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9639910Z 
2025-05-07T20:32:13.9640023Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9640700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9641136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9641429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9641985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.9642609Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.9643274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9643981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9644529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9645217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9645888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9646531Z     kernel = self.compile(
2025-05-07T20:32:13.9647080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9647741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9648146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9648379Z 
2025-05-07T20:32:13.9648591Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9fccb50>
2025-05-07T20:32:13.9649680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9651142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea3d0ee0>}
2025-05-07T20:32:13.9652488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9653502Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4000b0>
2025-05-07T20:32:13.9653799Z 
2025-05-07T20:32:13.9653973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9654508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9654984Z                            module_map=module_map)
2025-05-07T20:32:13.9655356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9655718Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9655987Z E       ^
2025-05-07T20:32:13.9656456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9656916Z 
2025-05-07T20:32:13.9657334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9657858Z 
2025-05-07T20:32:14.2423965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2424715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2425299Z     T=2048,
2025-05-07T20:32:14.2425518Z     D=7168,
2025-05-07T20:32:14.2425721Z     scale_ub=1200.0,
2025-05-07T20:32:14.2425962Z     contiguous=False,
2025-05-07T20:32:14.2426205Z     compiled=False,
2025-05-07T20:32:14.2426422Z )
2025-05-07T20:32:14.2427141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2427661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2427942Z 
2025-05-07T20:32:14.2428035Z     @given(
2025-05-07T20:32:14.2428280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2428614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2428940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2429286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2429637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2429941Z     )
2025-05-07T20:32:14.2430296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2430757Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2431014Z         self,
2025-05-07T20:32:14.2431215Z         T: int,
2025-05-07T20:32:14.2431427Z         D: int,
2025-05-07T20:32:14.2431671Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2431957Z         contiguous: bool,
2025-05-07T20:32:14.2432203Z         compiled: bool,
2025-05-07T20:32:14.2432446Z     ) -> None:
2025-05-07T20:32:14.2432678Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2432928Z     
2025-05-07T20:32:14.2433213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2433659Z     
2025-05-07T20:32:14.2433886Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2434191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2434518Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2434767Z         x0 = x[:, :D]
2025-05-07T20:32:14.2435001Z         x1 = x[:, D:]
2025-05-07T20:32:14.2435223Z     
2025-05-07T20:32:14.2435419Z         if contiguous:
2025-05-07T20:32:14.2435668Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2435947Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2436197Z     
2025-05-07T20:32:14.2436408Z         if scale_ub is not None:
2025-05-07T20:32:14.2436792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2437137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2437464Z             )
2025-05-07T20:32:14.2437675Z         else:
2025-05-07T20:32:14.2437892Z             scale_ub_tensor = None
2025-05-07T20:32:14.2438161Z     
2025-05-07T20:32:14.2438410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2438734Z             op = silu_mul_quant
2025-05-07T20:32:14.2449429Z             if compiled:
2025-05-07T20:32:14.2449755Z                 op = torch.compile(op)
2025-05-07T20:32:14.2450077Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2450373Z     
2025-05-07T20:32:14.2450582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2450754Z 
2025-05-07T20:32:14.2450862Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2451173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2451535Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2451828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2452534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2453236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2453781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2454474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2455146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2455696Z     kernel = self.compile(
2025-05-07T20:32:14.2456246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2456912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2457509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2457750Z 
2025-05-07T20:32:14.2457973Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9f8e430>
2025-05-07T20:32:14.2459053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2460453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e90550>}
2025-05-07T20:32:14.2461892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2462920Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea03d2b0>
2025-05-07T20:32:14.2463217Z 
2025-05-07T20:32:14.2463403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2463942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2464427Z                            module_map=module_map)
2025-05-07T20:32:14.2464913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2465277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2465557Z E       ^
2025-05-07T20:32:14.2466030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2466479Z 
2025-05-07T20:32:14.2466905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2467426Z 
2025-05-07T20:32:14.2467533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2467970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2468449Z     T=1,
2025-05-07T20:32:14.2468639Z     D=7168,
2025-05-07T20:32:14.2468840Z     scale_ub=None,
2025-05-07T20:32:14.2469065Z     contiguous=True,
2025-05-07T20:32:14.2469379Z     compiled=False,
2025-05-07T20:32:14.2469642Z )
2025-05-07T20:32:14.2470055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2470564Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2470836Z 
2025-05-07T20:32:14.2470919Z     @given(
2025-05-07T20:32:14.2471160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2471484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2471798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2472142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2472486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2472786Z     )
2025-05-07T20:32:14.2473153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2473607Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2473853Z         self,
2025-05-07T20:32:14.2474063Z         T: int,
2025-05-07T20:32:14.2474271Z         D: int,
2025-05-07T20:32:14.2474494Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2474780Z         contiguous: bool,
2025-05-07T20:32:14.2475034Z         compiled: bool,
2025-05-07T20:32:14.2475270Z     ) -> None:
2025-05-07T20:32:14.2475492Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2475749Z     
2025-05-07T20:32:14.2476031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2476376Z     
2025-05-07T20:32:14.2476582Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2476887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2477200Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2477459Z         x0 = x[:, :D]
2025-05-07T20:32:14.2477788Z         x1 = x[:, D:]
2025-05-07T20:32:14.2478001Z     
2025-05-07T20:32:14.2478198Z         if contiguous:
2025-05-07T20:32:14.2478440Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2478705Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2478959Z     
2025-05-07T20:32:14.2479166Z         if scale_ub is not None:
2025-05-07T20:32:14.2479445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2479795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2480119Z             )
2025-05-07T20:32:14.2480326Z         else:
2025-05-07T20:32:14.2480542Z             scale_ub_tensor = None
2025-05-07T20:32:14.2480810Z     
2025-05-07T20:32:14.2481055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2481374Z             op = silu_mul_quant
2025-05-07T20:32:14.2481639Z             if compiled:
2025-05-07T20:32:14.2481902Z                 op = torch.compile(op)
2025-05-07T20:32:14.2482246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2482551Z     
2025-05-07T20:32:14.2482759Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2482930Z 
2025-05-07T20:32:14.2483035Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2483342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2483685Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2484023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2484729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2485433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2485979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2486667Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2487353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2487955Z     kernel = self.compile(
2025-05-07T20:32:14.2488506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2489164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2489587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2489826Z 
2025-05-07T20:32:14.2490048Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9e921f0>
2025-05-07T20:32:14.2491145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2492522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e51160>}
2025-05-07T20:32:14.2493872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2494901Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9e52a30>
2025-05-07T20:32:14.2495203Z 
2025-05-07T20:32:14.2495388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2495922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2496408Z                            module_map=module_map)
2025-05-07T20:32:14.2496793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2497163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2497432Z E       ^
2025-05-07T20:32:14.2497909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2498442Z 
2025-05-07T20:32:14.2498879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2499397Z 
2025-05-07T20:32:14.2499503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2499932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2500349Z     T=16384,
2025-05-07T20:32:14.2500545Z     D=7168,
2025-05-07T20:32:14.2500749Z     scale_ub=1200.0,
2025-05-07T20:32:14.2500984Z     contiguous=False,
2025-05-07T20:32:14.2501318Z     compiled=True,
2025-05-07T20:32:14.2501537Z )
2025-05-07T20:32:14.4410254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.4411789Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.4412375Z 
2025-05-07T20:32:14.4412461Z     @given(
2025-05-07T20:32:14.4412710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.4413074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.4413392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.4413738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.4414084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.4414670Z     )
2025-05-07T20:32:14.4415032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.4415486Z     def test_silu_mul_quant(
2025-05-07T20:32:14.4415742Z         self,
2025-05-07T20:32:14.4415948Z         T: int,
2025-05-07T20:32:14.4416160Z         D: int,
2025-05-07T20:32:14.4416392Z         scale_ub: Optional[float],
2025-05-07T20:32:14.4416673Z         contiguous: bool,
2025-05-07T20:32:14.4416924Z         compiled: bool,
2025-05-07T20:32:14.4417165Z     ) -> None:
2025-05-07T20:32:14.4417387Z         torch.manual_seed(2025)
2025-05-07T20:32:14.4417645Z     
2025-05-07T20:32:14.4417941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.4418395Z     
2025-05-07T20:32:14.4418599Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.4418903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.4419221Z         x = x_sign * x_clamp
2025-05-07T20:32:14.4419476Z         x0 = x[:, :D]
2025-05-07T20:32:14.4419708Z         x1 = x[:, D:]
2025-05-07T20:32:14.4419919Z     
2025-05-07T20:32:14.4420121Z         if contiguous:
2025-05-07T20:32:14.4420367Z             x0 = x0.contiguous()
2025-05-07T20:32:14.4420633Z             x1 = x1.contiguous()
2025-05-07T20:32:14.4420890Z     
2025-05-07T20:32:14.4421214Z         if scale_ub is not None:
2025-05-07T20:32:14.4421509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.4421853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.4422178Z             )
2025-05-07T20:32:14.4422384Z         else:
2025-05-07T20:32:14.4422603Z             scale_ub_tensor = None
2025-05-07T20:32:14.4422873Z     
2025-05-07T20:32:14.4423121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.4423440Z             op = silu_mul_quant
2025-05-07T20:32:14.4423707Z             if compiled:
2025-05-07T20:32:14.4423977Z                 op = torch.compile(op)
2025-05-07T20:32:14.4424281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4424577Z     
2025-05-07T20:32:14.4424782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.4424955Z 
2025-05-07T20:32:14.4425061Z moe/activation_test.py:117: 
2025-05-07T20:32:14.4425375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4425725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.4426024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4426597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.4427173Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.4428008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.4428722Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.4429276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.4429975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.4430652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.4431190Z     kernel = self.compile(
2025-05-07T20:32:14.4431751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.4432428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.4432835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4433085Z 
2025-05-07T20:32:14.4433306Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea02c670>
2025-05-07T20:32:14.4434396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.4435857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e51dc0>}
2025-05-07T20:32:14.4437197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.4438211Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9d58930>
2025-05-07T20:32:14.4438510Z 
2025-05-07T20:32:14.4438687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.4439277Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.4439749Z                            module_map=module_map)
2025-05-07T20:32:14.4440403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.4440954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.4441275Z E       ^
2025-05-07T20:32:14.4441740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.4442196Z 
2025-05-07T20:32:14.4442620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.4443142Z 
2025-05-07T20:32:14.4443250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.4443672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.4444078Z     T=1,
2025-05-07T20:32:14.4444273Z     D=7168,
2025-05-07T20:32:14.4444485Z     scale_ub=None,
2025-05-07T20:32:14.4444708Z     contiguous=False,
2025-05-07T20:32:14.4444947Z     compiled=False,
2025-05-07T20:32:14.4445163Z )
2025-05-07T20:32:14.4445483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.4445983Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.4446255Z 
2025-05-07T20:32:14.4446337Z     @given(
2025-05-07T20:32:14.4446579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.4446896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.4447215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.4447560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.4447891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.4448187Z     )
2025-05-07T20:32:14.4448547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.4449141Z     def test_silu_mul_quant(
2025-05-07T20:32:14.4449400Z         self,
2025-05-07T20:32:14.4449608Z         T: int,
2025-05-07T20:32:14.4449808Z         D: int,
2025-05-07T20:32:14.4450041Z         scale_ub: Optional[float],
2025-05-07T20:32:14.4450327Z         contiguous: bool,
2025-05-07T20:32:14.4450578Z         compiled: bool,
2025-05-07T20:32:14.4450809Z     ) -> None:
2025-05-07T20:32:14.4451035Z         torch.manual_seed(2025)
2025-05-07T20:32:14.4451290Z     
2025-05-07T20:32:14.4451568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.4451946Z     
2025-05-07T20:32:14.4452174Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.4452469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.4452792Z         x = x_sign * x_clamp
2025-05-07T20:32:14.4453048Z         x0 = x[:, :D]
2025-05-07T20:32:14.4453269Z         x1 = x[:, D:]
2025-05-07T20:32:14.4453486Z     
2025-05-07T20:32:14.4453684Z         if contiguous:
2025-05-07T20:32:14.4453928Z             x0 = x0.contiguous()
2025-05-07T20:32:14.4454198Z             x1 = x1.contiguous()
2025-05-07T20:32:14.4454448Z     
2025-05-07T20:32:14.4454645Z         if scale_ub is not None:
2025-05-07T20:32:14.4454930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.4455273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.4455669Z             )
2025-05-07T20:32:14.4455866Z         else:
2025-05-07T20:32:14.4456085Z             scale_ub_tensor = None
2025-05-07T20:32:14.4456350Z     
2025-05-07T20:32:14.4456588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.4456917Z             op = silu_mul_quant
2025-05-07T20:32:14.4457180Z             if compiled:
2025-05-07T20:32:14.4457431Z                 op = torch.compile(op)
2025-05-07T20:32:14.4457741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4458027Z     
2025-05-07T20:32:14.4458224Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.4458400Z 
2025-05-07T20:32:14.4458585Z moe/activation_test.py:117: 
2025-05-07T20:32:14.4458892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4459228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.4459521Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4460216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.4460916Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.4461528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.4462217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.4462890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.4463432Z     kernel = self.compile(
2025-05-07T20:32:14.4463977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.4464645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.4465055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4465291Z 
2025-05-07T20:32:14.4465508Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9cfe550>
2025-05-07T20:32:14.4466591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.4467962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ed0790>}
2025-05-07T20:32:14.4469390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.4470425Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9cb23b0>
2025-05-07T20:32:14.4470717Z 
2025-05-07T20:32:14.4470892Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.4471427Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.4471901Z                            module_map=module_map)
2025-05-07T20:32:14.4472274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.4472642Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.4472914Z E       ^
2025-05-07T20:32:14.4473384Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.4473834Z 
2025-05-07T20:32:14.4474270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.4474793Z 
2025-05-07T20:32:14.4474901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.4475327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.4475740Z     T=2048,
2025-05-07T20:32:14.4475932Z     D=7168,
2025-05-07T20:32:14.4476185Z     scale_ub=None,
2025-05-07T20:32:14.4476420Z     contiguous=False,
2025-05-07T20:32:14.4476651Z     compiled=True,
2025-05-07T20:32:14.4476871Z )
2025-05-07T20:32:14.5657184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5658000Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.5658361Z 
2025-05-07T20:32:14.5658446Z     @given(
2025-05-07T20:32:14.5658695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5659019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5659344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5660037Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5660375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5660681Z     )
2025-05-07T20:32:14.5661045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5661617Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5661870Z         self,
2025-05-07T20:32:14.5662079Z         T: int,
2025-05-07T20:32:14.5662282Z         D: int,
2025-05-07T20:32:14.5662512Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5662795Z         contiguous: bool,
2025-05-07T20:32:14.5663047Z         compiled: bool,
2025-05-07T20:32:14.5663278Z     ) -> None:
2025-05-07T20:32:14.5663504Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5663759Z     
2025-05-07T20:32:14.5664037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5664392Z     
2025-05-07T20:32:14.5664597Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5664901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5665227Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5665481Z         x0 = x[:, :D]
2025-05-07T20:32:14.5665702Z         x1 = x[:, D:]
2025-05-07T20:32:14.5665923Z     
2025-05-07T20:32:14.5666122Z         if contiguous:
2025-05-07T20:32:14.5666364Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5666639Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5666896Z     
2025-05-07T20:32:14.5667094Z         if scale_ub is not None:
2025-05-07T20:32:14.5667389Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5667736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5668058Z             )
2025-05-07T20:32:14.5668256Z         else:
2025-05-07T20:32:14.5668481Z             scale_ub_tensor = None
2025-05-07T20:32:14.5668745Z     
2025-05-07T20:32:14.5668983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5669315Z             op = silu_mul_quant
2025-05-07T20:32:14.5669735Z             if compiled:
2025-05-07T20:32:14.5669997Z                 op = torch.compile(op)
2025-05-07T20:32:14.5670316Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5670605Z     
2025-05-07T20:32:14.5670804Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5670980Z 
2025-05-07T20:32:14.5671090Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5671402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5671742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5672044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5672620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.5673195Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.5673857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5674560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5675106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5675793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5676457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5677090Z     kernel = self.compile(
2025-05-07T20:32:14.5677642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5678302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5678710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5678952Z 
2025-05-07T20:32:14.5679167Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9cc4d90>
2025-05-07T20:32:14.5680261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5681700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9f2f430>}
2025-05-07T20:32:14.5683053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5684076Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9f145f0>
2025-05-07T20:32:14.5684368Z 
2025-05-07T20:32:14.5684571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5685108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5685593Z                            module_map=module_map)
2025-05-07T20:32:14.5685972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5686329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5686602Z E       ^
2025-05-07T20:32:14.5687082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5687537Z 
2025-05-07T20:32:14.5687966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5688480Z 
2025-05-07T20:32:14.5688588Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.5689010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5689422Z     T=4096,
2025-05-07T20:32:14.5689616Z     D=7168,
2025-05-07T20:32:14.5689822Z     scale_ub=None,
2025-05-07T20:32:14.5690057Z     contiguous=False,
2025-05-07T20:32:14.5690289Z     compiled=True,
2025-05-07T20:32:14.5690589Z )
2025-05-07T20:32:14.5690924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5691427Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.5691707Z 
2025-05-07T20:32:14.5691790Z     @given(
2025-05-07T20:32:14.5701834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5702170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5702496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5702848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5703181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5703484Z     )
2025-05-07T20:32:14.5703847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5704299Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5704563Z         self,
2025-05-07T20:32:14.5704772Z         T: int,
2025-05-07T20:32:14.5704992Z         D: int,
2025-05-07T20:32:14.5705224Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5705514Z         contiguous: bool,
2025-05-07T20:32:14.5705770Z         compiled: bool,
2025-05-07T20:32:14.5706001Z     ) -> None:
2025-05-07T20:32:14.5706237Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5706500Z     
2025-05-07T20:32:14.5706900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5707262Z     
2025-05-07T20:32:14.5707466Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5707763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5708088Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5708344Z         x0 = x[:, :D]
2025-05-07T20:32:14.5708565Z         x1 = x[:, D:]
2025-05-07T20:32:14.5708784Z     
2025-05-07T20:32:14.5708983Z         if contiguous:
2025-05-07T20:32:14.5709218Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5709489Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5709744Z     
2025-05-07T20:32:14.5709994Z         if scale_ub is not None:
2025-05-07T20:32:14.5710283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5710632Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5710957Z             )
2025-05-07T20:32:14.5711161Z         else:
2025-05-07T20:32:14.5711383Z             scale_ub_tensor = None
2025-05-07T20:32:14.5711661Z     
2025-05-07T20:32:14.5711897Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5712225Z             op = silu_mul_quant
2025-05-07T20:32:14.5712490Z             if compiled:
2025-05-07T20:32:14.5712743Z                 op = torch.compile(op)
2025-05-07T20:32:14.5713055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5713340Z     
2025-05-07T20:32:14.5713536Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5713712Z 
2025-05-07T20:32:14.5713819Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5714123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5714476Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5714765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5715345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.5715921Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.5716585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5717278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5717826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5718518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5719182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5719724Z     kernel = self.compile(
2025-05-07T20:32:14.5720361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5721029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5721446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5721691Z 
2025-05-07T20:32:14.5721903Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9bfbbb0>
2025-05-07T20:32:14.5722989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5724373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9b48040>}
2025-05-07T20:32:14.5725745Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5726788Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9b49630>
2025-05-07T20:32:14.5727082Z 
2025-05-07T20:32:14.5727264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5727840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5728314Z                            module_map=module_map)
2025-05-07T20:32:14.5728691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5729053Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5729316Z E       ^
2025-05-07T20:32:14.5729784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5730241Z 
2025-05-07T20:32:14.5730677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5731243Z 
2025-05-07T20:32:14.9769645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.9770838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.9771689Z     T=16384,
2025-05-07T20:32:14.9772006Z     D=5120,
2025-05-07T20:32:14.9772205Z     scale_ub=1200.0,
2025-05-07T20:32:14.9772446Z     contiguous=False,
2025-05-07T20:32:14.9772685Z     compiled=False,
2025-05-07T20:32:14.9772899Z )
2025-05-07T20:32:14.9773239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9773756Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.9774041Z 
2025-05-07T20:32:14.9774126Z     @given(
2025-05-07T20:32:14.9774375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9774701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9775046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9775392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9775739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9776037Z     )
2025-05-07T20:32:14.9776395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9776856Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9777111Z         self,
2025-05-07T20:32:14.9777315Z         T: int,
2025-05-07T20:32:14.9777525Z         D: int,
2025-05-07T20:32:14.9777758Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9778035Z         contiguous: bool,
2025-05-07T20:32:14.9778288Z         compiled: bool,
2025-05-07T20:32:14.9778527Z     ) -> None:
2025-05-07T20:32:14.9778750Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9779005Z     
2025-05-07T20:32:14.9779288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9779644Z     
2025-05-07T20:32:14.9780175Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9780490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9780825Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9781162Z         x0 = x[:, :D]
2025-05-07T20:32:14.9781391Z         x1 = x[:, D:]
2025-05-07T20:32:14.9781602Z     
2025-05-07T20:32:14.9781800Z         if contiguous:
2025-05-07T20:32:14.9782043Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9782309Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9782564Z     
2025-05-07T20:32:14.9782767Z         if scale_ub is not None:
2025-05-07T20:32:14.9783044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9783390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9783711Z             )
2025-05-07T20:32:14.9783907Z         else:
2025-05-07T20:32:14.9784130Z             scale_ub_tensor = None
2025-05-07T20:32:14.9784392Z     
2025-05-07T20:32:14.9784628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9784961Z             op = silu_mul_quant
2025-05-07T20:32:14.9785223Z             if compiled:
2025-05-07T20:32:14.9785475Z                 op = torch.compile(op)
2025-05-07T20:32:14.9785784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9786068Z     
2025-05-07T20:32:14.9786269Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9786518Z 
2025-05-07T20:32:14.9786623Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9786933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9787275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9787561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9788268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9788964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9789521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9790295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9790970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9791516Z     kernel = self.compile(
2025-05-07T20:32:14.9792063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9792726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9793135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9793369Z 
2025-05-07T20:32:14.9793585Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9ac5580>
2025-05-07T20:32:14.9794667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9796079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9b488b0>}
2025-05-07T20:32:14.9797417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9798440Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9aee6f0>
2025-05-07T20:32:14.9798732Z 
2025-05-07T20:32:14.9798911Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9799437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9799907Z                            module_map=module_map)
2025-05-07T20:32:14.9800284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9800721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9800999Z E       ^
2025-05-07T20:32:14.9801466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9801917Z 
2025-05-07T20:32:14.9802351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9802905Z 
2025-05-07T20:32:14.9803012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.9803436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.9803846Z     T=16384,
2025-05-07T20:32:14.9804041Z     D=5120,
2025-05-07T20:32:14.9804242Z     scale_ub=1200.0,
2025-05-07T20:32:14.9804474Z     contiguous=True,
2025-05-07T20:32:14.9804696Z     compiled=True,
2025-05-07T20:32:14.9804907Z )
2025-05-07T20:32:14.9805235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9805743Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.9806032Z 
2025-05-07T20:32:14.9806111Z     @given(
2025-05-07T20:32:14.9806353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9806674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9807030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9807369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9807706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9807993Z     )
2025-05-07T20:32:14.9808351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9808804Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9809048Z         self,
2025-05-07T20:32:14.9809250Z         T: int,
2025-05-07T20:32:14.9809455Z         D: int,
2025-05-07T20:32:14.9809682Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9809955Z         contiguous: bool,
2025-05-07T20:32:14.9810285Z         compiled: bool,
2025-05-07T20:32:14.9810518Z     ) -> None:
2025-05-07T20:32:14.9810737Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9810988Z     
2025-05-07T20:32:14.9811267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9811616Z     
2025-05-07T20:32:14.9811821Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9812127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9812444Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9812694Z         x0 = x[:, :D]
2025-05-07T20:32:14.9812924Z         x1 = x[:, D:]
2025-05-07T20:32:14.9813133Z     
2025-05-07T20:32:14.9813330Z         if contiguous:
2025-05-07T20:32:14.9813573Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9813835Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9814089Z     
2025-05-07T20:32:14.9814288Z         if scale_ub is not None:
2025-05-07T20:32:14.9814565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9814922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9815238Z             )
2025-05-07T20:32:14.9815437Z         else:
2025-05-07T20:32:14.9815649Z             scale_ub_tensor = None
2025-05-07T20:32:14.9815905Z     
2025-05-07T20:32:14.9816145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9816464Z             op = silu_mul_quant
2025-05-07T20:32:14.9816723Z             if compiled:
2025-05-07T20:32:14.9816979Z                 op = torch.compile(op)
2025-05-07T20:32:14.9817278Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9817562Z     
2025-05-07T20:32:14.9817763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9817928Z 
2025-05-07T20:32:14.9818031Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9818335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9818676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9818968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9819615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.9820196Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.9820853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9821626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9822168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9822853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9823523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9824053Z     kernel = self.compile(
2025-05-07T20:32:14.9824599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9825269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9825678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9825919Z 
2025-05-07T20:32:14.9826129Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9b0b250>
2025-05-07T20:32:14.9827261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9828640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9a215e0>}
2025-05-07T20:32:14.9829988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9831040Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9a7c830>
2025-05-07T20:32:14.9831339Z 
2025-05-07T20:32:14.9831509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9832045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9832524Z                            module_map=module_map)
2025-05-07T20:32:14.9832893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9833258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9833528Z E       ^
2025-05-07T20:32:14.9833995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9834453Z 
2025-05-07T20:32:14.9834867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9835386Z 
2025-05-07T20:32:15.2077554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2078358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2079035Z     T=16384,
2025-05-07T20:32:15.2079353Z     D=5120,
2025-05-07T20:32:15.2079657Z     scale_ub=None,
2025-05-07T20:32:15.2080014Z     contiguous=False,
2025-05-07T20:32:15.2080399Z     compiled=True,
2025-05-07T20:32:15.2080733Z )
2025-05-07T20:32:15.2081259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2082107Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.2082508Z 
2025-05-07T20:32:15.2082616Z     @given(
2025-05-07T20:32:15.2082940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2083377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2083817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2084288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2085180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2085618Z     )
2025-05-07T20:32:15.2086163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2086874Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2087227Z         self,
2025-05-07T20:32:15.2087521Z         T: int,
2025-05-07T20:32:15.2087829Z         D: int,
2025-05-07T20:32:15.2088150Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2088565Z         contiguous: bool,
2025-05-07T20:32:15.2088941Z         compiled: bool,
2025-05-07T20:32:15.2089298Z     ) -> None:
2025-05-07T20:32:15.2089632Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2090008Z     
2025-05-07T20:32:15.2090415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2090966Z     
2025-05-07T20:32:15.2091286Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2091747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2092246Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2092633Z         x0 = x[:, :D]
2025-05-07T20:32:15.2092984Z         x1 = x[:, D:]
2025-05-07T20:32:15.2093317Z     
2025-05-07T20:32:15.2093619Z         if contiguous:
2025-05-07T20:32:15.2093997Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2094414Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2094969Z     
2025-05-07T20:32:15.2095296Z         if scale_ub is not None:
2025-05-07T20:32:15.2095754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2096322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2096847Z             )
2025-05-07T20:32:15.2097157Z         else:
2025-05-07T20:32:15.2097499Z             scale_ub_tensor = None
2025-05-07T20:32:15.2097916Z     
2025-05-07T20:32:15.2098292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2098828Z             op = silu_mul_quant
2025-05-07T20:32:15.2099249Z             if compiled:
2025-05-07T20:32:15.2099870Z                 op = torch.compile(op)
2025-05-07T20:32:15.2100367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2100842Z     
2025-05-07T20:32:15.2101299Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2101588Z 
2025-05-07T20:32:15.2101749Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2102246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2102812Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2103280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2104247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.2105220Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.2106362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2107555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2108499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2109711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2110859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2111801Z     kernel = self.compile(
2025-05-07T20:32:15.2112804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2113947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2114621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2115017Z 
2025-05-07T20:32:15.2115349Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a82f10>
2025-05-07T20:32:15.2117318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2119684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9c405e0>}
2025-05-07T20:32:15.2121994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2123783Z context = <triton._C.libtriton.ir.context object at 0x7fd7e997f570>
2025-05-07T20:32:15.2124284Z 
2025-05-07T20:32:15.2124557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2125447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2126250Z                            module_map=module_map)
2025-05-07T20:32:15.2126873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2127448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2127880Z E       ^
2025-05-07T20:32:15.2128668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2129569Z 
2025-05-07T20:32:15.2130302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2131203Z 
2025-05-07T20:32:15.2131381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2132077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2132770Z     T=2048,
2025-05-07T20:32:15.2133074Z     D=5120,
2025-05-07T20:32:15.2133389Z     scale_ub=None,
2025-05-07T20:32:15.2133728Z     contiguous=False,
2025-05-07T20:32:15.2134090Z     compiled=True,
2025-05-07T20:32:15.2134423Z )
2025-05-07T20:32:15.3350086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3351289Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.3351755Z 
2025-05-07T20:32:15.3351892Z     @given(
2025-05-07T20:32:15.3352268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3352853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3353334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3353833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3354320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3354787Z     )
2025-05-07T20:32:15.3355387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3356154Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3356559Z         self,
2025-05-07T20:32:15.3356876Z         T: int,
2025-05-07T20:32:15.3357193Z         D: int,
2025-05-07T20:32:15.3357550Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3358017Z         contiguous: bool,
2025-05-07T20:32:15.3358408Z         compiled: bool,
2025-05-07T20:32:15.3358779Z     ) -> None:
2025-05-07T20:32:15.3359158Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3359556Z     
2025-05-07T20:32:15.3360009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3360598Z     
2025-05-07T20:32:15.3360917Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3361389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3361908Z         x = x_sign * x_clamp
2025-05-07T20:32:15.3362356Z         x0 = x[:, :D]
2025-05-07T20:32:15.3362700Z         x1 = x[:, D:]
2025-05-07T20:32:15.3363043Z     
2025-05-07T20:32:15.3363349Z         if contiguous:
2025-05-07T20:32:15.3363722Z             x0 = x0.contiguous()
2025-05-07T20:32:15.3364153Z             x1 = x1.contiguous()
2025-05-07T20:32:15.3364555Z     
2025-05-07T20:32:15.3364863Z         if scale_ub is not None:
2025-05-07T20:32:15.3365540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.3366115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.3366629Z             )
2025-05-07T20:32:15.3378895Z         else:
2025-05-07T20:32:15.3379273Z             scale_ub_tensor = None
2025-05-07T20:32:15.3379695Z     
2025-05-07T20:32:15.3380083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.3380603Z             op = silu_mul_quant
2025-05-07T20:32:15.3381025Z             if compiled:
2025-05-07T20:32:15.3381557Z                 op = torch.compile(op)
2025-05-07T20:32:15.3382030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3382492Z     
2025-05-07T20:32:15.3382803Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.3383077Z 
2025-05-07T20:32:15.3383245Z moe/activation_test.py:117: 
2025-05-07T20:32:15.3383726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3384280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.3384773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3385738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.3386724Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.3387885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.3389259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.3390204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.3391404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.3392582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.3393515Z     kernel = self.compile(
2025-05-07T20:32:15.3394466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.3395686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.3396362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3396770Z 
2025-05-07T20:32:15.3397117Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a67370>
2025-05-07T20:32:15.3399043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.3401544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9a21c10>}
2025-05-07T20:32:15.3404038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.3405862Z context = <triton._C.libtriton.ir.context object at 0x7fd7e99c9670>
2025-05-07T20:32:15.3406375Z 
2025-05-07T20:32:15.3406653Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.3407563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.3408373Z                            module_map=module_map)
2025-05-07T20:32:15.3408980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.3409577Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.3410016Z E       ^
2025-05-07T20:32:15.3410813Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.3411632Z 
2025-05-07T20:32:15.3412368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.3413462Z 
2025-05-07T20:32:15.3413638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3414348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3415035Z     T=2048,
2025-05-07T20:32:15.3415346Z     D=5120,
2025-05-07T20:32:15.3415671Z     scale_ub=1200.0,
2025-05-07T20:32:15.3416033Z     contiguous=False,
2025-05-07T20:32:15.3416409Z     compiled=True,
2025-05-07T20:32:15.3416750Z )
2025-05-07T20:32:15.3417277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3418136Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.3418611Z 
2025-05-07T20:32:15.3418748Z     @given(
2025-05-07T20:32:15.3419125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3419653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3420175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3420731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3421360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3421741Z     )
2025-05-07T20:32:15.3422243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3422835Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3423268Z         self,
2025-05-07T20:32:15.3423547Z         T: int,
2025-05-07T20:32:15.3423815Z         D: int,
2025-05-07T20:32:15.3424125Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3424507Z         contiguous: bool,
2025-05-07T20:32:15.3424837Z         compiled: bool,
2025-05-07T20:32:15.3425159Z     ) -> None:
2025-05-07T20:32:15.3425455Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3425785Z     
2025-05-07T20:32:15.3426152Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3426661Z     
2025-05-07T20:32:15.3426955Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3427385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3427951Z         x = x_sign * x_clamp
2025-05-07T20:32:15.3428310Z         x0 = x[:, :D]
2025-05-07T20:32:15.3428636Z         x1 = x[:, D:]
2025-05-07T20:32:15.3428960Z     
2025-05-07T20:32:15.3429240Z         if contiguous:
2025-05-07T20:32:15.3429573Z             x0 = x0.contiguous()
2025-05-07T20:32:15.3429973Z             x1 = x1.contiguous()
2025-05-07T20:32:15.3430333Z     
2025-05-07T20:32:15.3430615Z         if scale_ub is not None:
2025-05-07T20:32:15.3431025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.3431510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.3431947Z             )
2025-05-07T20:32:15.3432234Z         else:
2025-05-07T20:32:15.3432542Z             scale_ub_tensor = None
2025-05-07T20:32:15.3432920Z     
2025-05-07T20:32:15.3433269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.3433769Z             op = silu_mul_quant
2025-05-07T20:32:15.3434149Z             if compiled:
2025-05-07T20:32:15.3434521Z                 op = torch.compile(op)
2025-05-07T20:32:15.3434993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3435431Z     
2025-05-07T20:32:15.3435722Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.3435993Z 
2025-05-07T20:32:15.3436145Z moe/activation_test.py:117: 
2025-05-07T20:32:15.3436620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3437142Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.3437588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.3438490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.3439392Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.3440684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.3441822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.3442915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.3444019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.3445098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.3445961Z     kernel = self.compile(
2025-05-07T20:32:15.3446833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.3447840Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.3448481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.3448854Z 
2025-05-07T20:32:15.3449190Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e990cd60>
2025-05-07T20:32:15.3450995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.3453431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e98e8820>}
2025-05-07T20:32:15.3455854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.3457575Z context = <triton._C.libtriton.ir.context object at 0x7fd7e993e470>
2025-05-07T20:32:15.3458074Z 
2025-05-07T20:32:15.3458348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.3459256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.3460079Z                            module_map=module_map)
2025-05-07T20:32:15.3460811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.3461491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.3461934Z E       ^
2025-05-07T20:32:15.3462741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.3463553Z 
2025-05-07T20:32:15.3464290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.3465221Z 
2025-05-07T20:32:15.5699072Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5699817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5700482Z     T=4096,
2025-05-07T20:32:15.5700781Z     D=5120,
2025-05-07T20:32:15.5701265Z     scale_ub=1200.0,
2025-05-07T20:32:15.5701632Z     contiguous=True,
2025-05-07T20:32:15.5701992Z     compiled=True,
2025-05-07T20:32:15.5702343Z )
2025-05-07T20:32:15.5702865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5703690Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.5704071Z 
2025-05-07T20:32:15.5704182Z     @given(
2025-05-07T20:32:15.5704502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5704959Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5705401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5705894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5706392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5706827Z     )
2025-05-07T20:32:15.5707376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5708077Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5708424Z         self,
2025-05-07T20:32:15.5708716Z         T: int,
2025-05-07T20:32:15.5709037Z         D: int,
2025-05-07T20:32:15.5710306Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5710754Z         contiguous: bool,
2025-05-07T20:32:15.5711135Z         compiled: bool,
2025-05-07T20:32:15.5711498Z     ) -> None:
2025-05-07T20:32:15.5711867Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5712326Z     
2025-05-07T20:32:15.5712754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5713315Z     
2025-05-07T20:32:15.5713618Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.5714097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.5714610Z         x = x_sign * x_clamp
2025-05-07T20:32:15.5714996Z         x0 = x[:, :D]
2025-05-07T20:32:15.5715338Z         x1 = x[:, D:]
2025-05-07T20:32:15.5715674Z     
2025-05-07T20:32:15.5715975Z         if contiguous:
2025-05-07T20:32:15.5716362Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5716795Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5717192Z     
2025-05-07T20:32:15.5717515Z         if scale_ub is not None:
2025-05-07T20:32:15.5717977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5718524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5719044Z             )
2025-05-07T20:32:15.5719360Z         else:
2025-05-07T20:32:15.5719705Z             scale_ub_tensor = None
2025-05-07T20:32:15.5720267Z     
2025-05-07T20:32:15.5720649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5721195Z             op = silu_mul_quant
2025-05-07T20:32:15.5721605Z             if compiled:
2025-05-07T20:32:15.5722021Z                 op = torch.compile(op)
2025-05-07T20:32:15.5722517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5722974Z     
2025-05-07T20:32:15.5723289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.5723569Z 
2025-05-07T20:32:15.5723740Z moe/activation_test.py:117: 
2025-05-07T20:32:15.5724235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5724935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.5725408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5726375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.5727333Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.5728483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.5729698Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.5730630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5731820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5733042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5733967Z     kernel = self.compile(
2025-05-07T20:32:15.5734911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5736036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5736698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5737085Z 
2025-05-07T20:32:15.5737417Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e99cdc10>
2025-05-07T20:32:15.5739215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5741912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ddd430>}
2025-05-07T20:32:15.5744496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5746299Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9dce670>
2025-05-07T20:32:15.5746797Z 
2025-05-07T20:32:15.5747081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5747977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5748764Z                            module_map=module_map)
2025-05-07T20:32:15.5749368Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5749950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.5750381Z E       ^
2025-05-07T20:32:15.5751180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5751978Z 
2025-05-07T20:32:15.5752717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5753637Z 
2025-05-07T20:32:15.5753808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5754513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5755179Z     T=128,
2025-05-07T20:32:15.5755591Z     D=5120,
2025-05-07T20:32:15.5755903Z     scale_ub=1200.0,
2025-05-07T20:32:15.5756266Z     contiguous=False,
2025-05-07T20:32:15.5756629Z     compiled=True,
2025-05-07T20:32:15.5756968Z )
2025-05-07T20:32:15.9157143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9158052Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.9158515Z 
2025-05-07T20:32:15.9158654Z     @given(
2025-05-07T20:32:15.9159028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9159559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9160066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9160887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9161408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9161888Z     )
2025-05-07T20:32:15.9162472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9163245Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9163644Z         self,
2025-05-07T20:32:15.9163950Z         T: int,
2025-05-07T20:32:15.9164276Z         D: int,
2025-05-07T20:32:15.9164632Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9165084Z         contiguous: bool,
2025-05-07T20:32:15.9165470Z         compiled: bool,
2025-05-07T20:32:15.9165841Z     ) -> None:
2025-05-07T20:32:15.9166190Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9166585Z     
2025-05-07T20:32:15.9167031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9167619Z     
2025-05-07T20:32:15.9167928Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9168422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9168947Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9169334Z         x0 = x[:, :D]
2025-05-07T20:32:15.9169691Z         x1 = x[:, D:]
2025-05-07T20:32:15.9170030Z     
2025-05-07T20:32:15.9170325Z         if contiguous:
2025-05-07T20:32:15.9170715Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9171145Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9171537Z     
2025-05-07T20:32:15.9171852Z         if scale_ub is not None:
2025-05-07T20:32:15.9172304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9172906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9173425Z             )
2025-05-07T20:32:15.9173740Z         else:
2025-05-07T20:32:15.9174082Z             scale_ub_tensor = None
2025-05-07T20:32:15.9174501Z     
2025-05-07T20:32:15.9174876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9175633Z             op = silu_mul_quant
2025-05-07T20:32:15.9176051Z             if compiled:
2025-05-07T20:32:15.9176458Z                 op = torch.compile(op)
2025-05-07T20:32:15.9176951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9177405Z     
2025-05-07T20:32:15.9177720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9178001Z 
2025-05-07T20:32:15.9178171Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9178657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9179221Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9179689Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9180647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9181739Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9182889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9184112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9184990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9186151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9187392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9188300Z     kernel = self.compile(
2025-05-07T20:32:15.9189196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9190300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9190963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9191356Z 
2025-05-07T20:32:15.9191708Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9df0be0>
2025-05-07T20:32:15.9193631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9196216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e982b040>}
2025-05-07T20:32:15.9198639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9200461Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9822230>
2025-05-07T20:32:15.9200966Z 
2025-05-07T20:32:15.9201251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9202147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9202979Z                            module_map=module_map)
2025-05-07T20:32:15.9203595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9204181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9204619Z E       ^
2025-05-07T20:32:15.9205404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9206212Z 
2025-05-07T20:32:15.9206954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9207863Z 
2025-05-07T20:32:15.9208031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9208736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9209429Z     T=16384,
2025-05-07T20:32:15.9209740Z     D=7168,
2025-05-07T20:32:15.9210058Z     scale_ub=1200.0,
2025-05-07T20:32:15.9210427Z     contiguous=True,
2025-05-07T20:32:15.9210908Z     compiled=True,
2025-05-07T20:32:15.9211255Z )
2025-05-07T20:32:15.9211790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9212647Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.9213124Z 
2025-05-07T20:32:15.9213248Z     @given(
2025-05-07T20:32:15.9213627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9214155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9214661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9215225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9215783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9216259Z     )
2025-05-07T20:32:15.9216857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9217619Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9218011Z         self,
2025-05-07T20:32:15.9218330Z         T: int,
2025-05-07T20:32:15.9218667Z         D: int,
2025-05-07T20:32:15.9219030Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9219477Z         contiguous: bool,
2025-05-07T20:32:15.9219878Z         compiled: bool,
2025-05-07T20:32:15.9220245Z     ) -> None:
2025-05-07T20:32:15.9220588Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9221149Z     
2025-05-07T20:32:15.9221603Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9222174Z     
2025-05-07T20:32:15.9222490Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9222974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9223489Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9223892Z         x0 = x[:, :D]
2025-05-07T20:32:15.9224249Z         x1 = x[:, D:]
2025-05-07T20:32:15.9224584Z     
2025-05-07T20:32:15.9224891Z         if contiguous:
2025-05-07T20:32:15.9225278Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9225703Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9226186Z     
2025-05-07T20:32:15.9226506Z         if scale_ub is not None:
2025-05-07T20:32:15.9226963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9227522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9228009Z             )
2025-05-07T20:32:15.9228308Z         else:
2025-05-07T20:32:15.9228586Z             scale_ub_tensor = None
2025-05-07T20:32:15.9228930Z     
2025-05-07T20:32:15.9229255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9229666Z             op = silu_mul_quant
2025-05-07T20:32:15.9230016Z             if compiled:
2025-05-07T20:32:15.9230358Z                 op = torch.compile(op)
2025-05-07T20:32:15.9230764Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9231148Z     
2025-05-07T20:32:15.9231413Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9231647Z 
2025-05-07T20:32:15.9231785Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9232193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9232648Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9233066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9233879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9234686Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9235661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9236660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9237459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9238472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9239439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9240679Z     kernel = self.compile(
2025-05-07T20:32:15.9241534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9242554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9243180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9243586Z 
2025-05-07T20:32:15.9243910Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98289a0>
2025-05-07T20:32:15.9245681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9247961Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e982bb80>}
2025-05-07T20:32:15.9250173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9251847Z context = <triton._C.libtriton.ir.context object at 0x7fd7e974a130>
2025-05-07T20:32:15.9252424Z 
2025-05-07T20:32:15.9252685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9253528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9254272Z                            module_map=module_map)
2025-05-07T20:32:15.9254845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9255400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9255802Z E       ^
2025-05-07T20:32:15.9256552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9257296Z 
2025-05-07T20:32:15.9258091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9258925Z 
2025-05-07T20:32:16.2018284Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2036191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2036939Z     T=16384,
2025-05-07T20:32:16.2037262Z     D=5120,
2025-05-07T20:32:16.2037571Z     scale_ub=1200.0,
2025-05-07T20:32:16.2037946Z     contiguous=True,
2025-05-07T20:32:16.2038318Z     compiled=False,
2025-05-07T20:32:16.2038646Z )
2025-05-07T20:32:16.2039181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2040037Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.2040832Z 
2025-05-07T20:32:16.2040973Z     @given(
2025-05-07T20:32:16.2041352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2041889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2042430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2042996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2043561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2044051Z     )
2025-05-07T20:32:16.2044652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2045428Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2045844Z         self,
2025-05-07T20:32:16.2046154Z         T: int,
2025-05-07T20:32:16.2046479Z         D: int,
2025-05-07T20:32:16.2046840Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2047291Z         contiguous: bool,
2025-05-07T20:32:16.2047696Z         compiled: bool,
2025-05-07T20:32:16.2048072Z     ) -> None:
2025-05-07T20:32:16.2048427Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2048829Z     
2025-05-07T20:32:16.2049284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2049878Z     
2025-05-07T20:32:16.2050598Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2051108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2051645Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2052045Z         x0 = x[:, :D]
2025-05-07T20:32:16.2052404Z         x1 = x[:, D:]
2025-05-07T20:32:16.2052764Z     
2025-05-07T20:32:16.2053063Z         if contiguous:
2025-05-07T20:32:16.2053456Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2053898Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2054290Z     
2025-05-07T20:32:16.2054599Z         if scale_ub is not None:
2025-05-07T20:32:16.2055066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2055631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2056161Z             )
2025-05-07T20:32:16.2056480Z         else:
2025-05-07T20:32:16.2056804Z             scale_ub_tensor = None
2025-05-07T20:32:16.2057227Z     
2025-05-07T20:32:16.2057615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2058135Z             op = silu_mul_quant
2025-05-07T20:32:16.2058525Z             if compiled:
2025-05-07T20:32:16.2058925Z                 op = torch.compile(op)
2025-05-07T20:32:16.2059378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2059826Z     
2025-05-07T20:32:16.2060256Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2060509Z 
2025-05-07T20:32:16.2060675Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2061262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2061811Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2062278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2063434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2064614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2065549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2066877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2068034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2068966Z     kernel = self.compile(
2025-05-07T20:32:16.2069880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2071019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2071686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2072099Z 
2025-05-07T20:32:16.2072489Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9838b20>
2025-05-07T20:32:16.2074403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2076853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97835e0>}
2025-05-07T20:32:16.2079248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2081058Z context = <triton._C.libtriton.ir.context object at 0x7fd7e978a470>
2025-05-07T20:32:16.2081570Z 
2025-05-07T20:32:16.2081846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2082721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2083518Z                            module_map=module_map)
2025-05-07T20:32:16.2084263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2084868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2085310Z E       ^
2025-05-07T20:32:16.2086103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2086909Z 
2025-05-07T20:32:16.2087640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2088536Z 
2025-05-07T20:32:16.2088714Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2089402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2090090Z     T=1,
2025-05-07T20:32:16.2090398Z     D=7168,
2025-05-07T20:32:16.2090717Z     scale_ub=1200.0,
2025-05-07T20:32:16.2091077Z     contiguous=False,
2025-05-07T20:32:16.2091453Z     compiled=False,
2025-05-07T20:32:16.2091794Z )
2025-05-07T20:32:16.2092336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2093165Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.2093608Z 
2025-05-07T20:32:16.2093746Z     @given(
2025-05-07T20:32:16.2094113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2094639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2095239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2095792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2096339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2096815Z     )
2025-05-07T20:32:16.2097402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2098150Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2098559Z         self,
2025-05-07T20:32:16.2098879Z         T: int,
2025-05-07T20:32:16.2099213Z         D: int,
2025-05-07T20:32:16.2099574Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2100123Z         contiguous: bool,
2025-05-07T20:32:16.2100514Z         compiled: bool,
2025-05-07T20:32:16.2100887Z     ) -> None:
2025-05-07T20:32:16.2101314Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2101692Z     
2025-05-07T20:32:16.2102131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2102711Z     
2025-05-07T20:32:16.2103022Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.2103521Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2104040Z         x = x_sign * x_clamp
2025-05-07T20:32:16.2104448Z         x0 = x[:, :D]
2025-05-07T20:32:16.2104808Z         x1 = x[:, D:]
2025-05-07T20:32:16.2105150Z     
2025-05-07T20:32:16.2105443Z         if contiguous:
2025-05-07T20:32:16.2105826Z             x0 = x0.contiguous()
2025-05-07T20:32:16.2106262Z             x1 = x1.contiguous()
2025-05-07T20:32:16.2106656Z     
2025-05-07T20:32:16.2106968Z         if scale_ub is not None:
2025-05-07T20:32:16.2107433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.2107989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.2108519Z             )
2025-05-07T20:32:16.2108836Z         else:
2025-05-07T20:32:16.2109171Z             scale_ub_tensor = None
2025-05-07T20:32:16.2109594Z     
2025-05-07T20:32:16.2109975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.2110501Z             op = silu_mul_quant
2025-05-07T20:32:16.2110921Z             if compiled:
2025-05-07T20:32:16.2111328Z                 op = torch.compile(op)
2025-05-07T20:32:16.2111808Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2112258Z     
2025-05-07T20:32:16.2112564Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.2112834Z 
2025-05-07T20:32:16.2113006Z moe/activation_test.py:117: 
2025-05-07T20:32:16.2113470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2114014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.2114635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.2115817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.2117068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.2117910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.2118939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.2120038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.2120971Z     kernel = self.compile(
2025-05-07T20:32:16.2121910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.2123041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2123731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.2124138Z 
2025-05-07T20:32:16.2124485Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a09700>
2025-05-07T20:32:16.2126400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.2128939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97839d0>}
2025-05-07T20:32:16.2131309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.2133171Z context = <triton._C.libtriton.ir.context object at 0x7fd7e96a3b70>
2025-05-07T20:32:16.2133664Z 
2025-05-07T20:32:16.2134027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.2134931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2135729Z                            module_map=module_map)
2025-05-07T20:32:16.2136343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2136936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2137362Z E       ^
2025-05-07T20:32:16.2138168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.2138971Z 
2025-05-07T20:32:16.2139711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.2140916Z 
2025-05-07T20:32:16.2141149Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2141848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2142511Z     T=4096,
2025-05-07T20:32:16.2142815Z     D=7168,
2025-05-07T20:32:16.2143120Z     scale_ub=1200.0,
2025-05-07T20:32:16.2143482Z     contiguous=False,
2025-05-07T20:32:16.2143847Z     compiled=True,
2025-05-07T20:32:16.2144170Z )
2025-05-07T20:32:16.3312826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3313774Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3314259Z 
2025-05-07T20:32:16.3314383Z     @given(
2025-05-07T20:32:16.3314767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3315299Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3315793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3316320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3316807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3317253Z     )
2025-05-07T20:32:16.3318203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3318992Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3319382Z         self,
2025-05-07T20:32:16.3319726Z         T: int,
2025-05-07T20:32:16.3320044Z         D: int,
2025-05-07T20:32:16.3320402Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3320857Z         contiguous: bool,
2025-05-07T20:32:16.3321249Z         compiled: bool,
2025-05-07T20:32:16.3321619Z     ) -> None:
2025-05-07T20:32:16.3321966Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3322358Z     
2025-05-07T20:32:16.3322805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3323385Z     
2025-05-07T20:32:16.3323695Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3324162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3324680Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3325079Z         x0 = x[:, :D]
2025-05-07T20:32:16.3325429Z         x1 = x[:, D:]
2025-05-07T20:32:16.3325775Z     
2025-05-07T20:32:16.3326089Z         if contiguous:
2025-05-07T20:32:16.3326468Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3326899Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3327299Z     
2025-05-07T20:32:16.3327607Z         if scale_ub is not None:
2025-05-07T20:32:16.3328062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3328749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3329259Z             )
2025-05-07T20:32:16.3329576Z         else:
2025-05-07T20:32:16.3329918Z             scale_ub_tensor = None
2025-05-07T20:32:16.3330330Z     
2025-05-07T20:32:16.3330711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3331238Z             op = silu_mul_quant
2025-05-07T20:32:16.3331657Z             if compiled:
2025-05-07T20:32:16.3332065Z                 op = torch.compile(op)
2025-05-07T20:32:16.3332603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3333071Z     
2025-05-07T20:32:16.3333517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3333804Z 
2025-05-07T20:32:16.3333965Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3334463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3335014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3335497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3336465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3337434Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3338579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3339785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3340985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3342261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3343397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3344308Z     kernel = self.compile(
2025-05-07T20:32:16.3345212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3346318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3346979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3347374Z 
2025-05-07T20:32:16.3347730Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e974d550>
2025-05-07T20:32:16.3349658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3352342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9696c10>}
2025-05-07T20:32:16.3354807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3356631Z context = <triton._C.libtriton.ir.context object at 0x7fd7e98a5af0>
2025-05-07T20:32:16.3357130Z 
2025-05-07T20:32:16.3357413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3358298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3359116Z                            module_map=module_map)
2025-05-07T20:32:16.3359728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3360318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3360748Z E       ^
2025-05-07T20:32:16.3361550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3362360Z 
2025-05-07T20:32:16.3363101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3364118Z 
2025-05-07T20:32:16.3364288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3364993Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3365690Z     T=128,
2025-05-07T20:32:16.3365999Z     D=7168,
2025-05-07T20:32:16.3366310Z     scale_ub=1200.0,
2025-05-07T20:32:16.3366682Z     contiguous=False,
2025-05-07T20:32:16.3367054Z     compiled=True,
2025-05-07T20:32:16.3367388Z )
2025-05-07T20:32:16.3367930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3368775Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.3369351Z 
2025-05-07T20:32:16.3369482Z     @given(
2025-05-07T20:32:16.3369852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3370383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3370892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3371456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3372026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3372513Z     )
2025-05-07T20:32:16.3373101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3373866Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3374273Z         self,
2025-05-07T20:32:16.3374584Z         T: int,
2025-05-07T20:32:16.3374917Z         D: int,
2025-05-07T20:32:16.3375271Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3375714Z         contiguous: bool,
2025-05-07T20:32:16.3376111Z         compiled: bool,
2025-05-07T20:32:16.3376479Z     ) -> None:
2025-05-07T20:32:16.3376833Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3377236Z     
2025-05-07T20:32:16.3377685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3378260Z     
2025-05-07T20:32:16.3378573Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3379056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3379584Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3379976Z         x0 = x[:, :D]
2025-05-07T20:32:16.3380333Z         x1 = x[:, D:]
2025-05-07T20:32:16.3380680Z     
2025-05-07T20:32:16.3380978Z         if contiguous:
2025-05-07T20:32:16.3381466Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3381896Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3382296Z     
2025-05-07T20:32:16.3382611Z         if scale_ub is not None:
2025-05-07T20:32:16.3383122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3383658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3384151Z             )
2025-05-07T20:32:16.3384534Z         else:
2025-05-07T20:32:16.3384819Z             scale_ub_tensor = None
2025-05-07T20:32:16.3385165Z     
2025-05-07T20:32:16.3385492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3385920Z             op = silu_mul_quant
2025-05-07T20:32:16.3386268Z             if compiled:
2025-05-07T20:32:16.3386604Z                 op = torch.compile(op)
2025-05-07T20:32:16.3387000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3387385Z     
2025-05-07T20:32:16.3387655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3387885Z 
2025-05-07T20:32:16.3388030Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3388443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3388903Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3389326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3390163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3390984Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3391969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3392987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3393870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3394937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3396013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3396873Z     kernel = self.compile(
2025-05-07T20:32:16.3397760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3398893Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3399715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3400107Z 
2025-05-07T20:32:16.3400460Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98c9910>
2025-05-07T20:32:16.3402295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3404691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e98a9820>}
2025-05-07T20:32:16.3407088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3408910Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9704c30>
2025-05-07T20:32:16.3409431Z 
2025-05-07T20:32:16.3409708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3410616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3411431Z                            module_map=module_map)
2025-05-07T20:32:16.3412037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3412636Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3413077Z E       ^
2025-05-07T20:32:16.3413886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3414696Z 
2025-05-07T20:32:16.3415428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3416347Z 
2025-05-07T20:32:16.5092428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5093281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5093724Z     T=2048,
2025-05-07T20:32:16.5093954Z     D=7168,
2025-05-07T20:32:16.5094149Z     scale_ub=None,
2025-05-07T20:32:16.5094373Z     contiguous=True,
2025-05-07T20:32:16.5094607Z     compiled=True,
2025-05-07T20:32:16.5094822Z )
2025-05-07T20:32:16.5095156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5095671Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5095943Z 
2025-05-07T20:32:16.5096032Z     @given(
2025-05-07T20:32:16.5096264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5096589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5096907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5097239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5097582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5097873Z     )
2025-05-07T20:32:16.5098235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5098686Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5098938Z         self,
2025-05-07T20:32:16.5099133Z         T: int,
2025-05-07T20:32:16.5099338Z         D: int,
2025-05-07T20:32:16.5099567Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5099921Z         contiguous: bool,
2025-05-07T20:32:16.5100162Z         compiled: bool,
2025-05-07T20:32:16.5100399Z     ) -> None:
2025-05-07T20:32:16.5100622Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5100866Z     
2025-05-07T20:32:16.5101242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5101595Z     
2025-05-07T20:32:16.5101789Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5102088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5102406Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5102651Z         x0 = x[:, :D]
2025-05-07T20:32:16.5102968Z         x1 = x[:, D:]
2025-05-07T20:32:16.5103183Z     
2025-05-07T20:32:16.5103369Z         if contiguous:
2025-05-07T20:32:16.5103610Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5103878Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5104122Z     
2025-05-07T20:32:16.5104319Z         if scale_ub is not None:
2025-05-07T20:32:16.5104604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5104952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5113510Z             )
2025-05-07T20:32:16.5113746Z         else:
2025-05-07T20:32:16.5113969Z             scale_ub_tensor = None
2025-05-07T20:32:16.5114235Z     
2025-05-07T20:32:16.5114476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5114810Z             op = silu_mul_quant
2025-05-07T20:32:16.5115077Z             if compiled:
2025-05-07T20:32:16.5115328Z                 op = torch.compile(op)
2025-05-07T20:32:16.5115640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5115939Z     
2025-05-07T20:32:16.5116135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5116311Z 
2025-05-07T20:32:16.5116417Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5116728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5117078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5117372Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5117950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5118524Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5119183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5119887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5120434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5121243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5121923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5122467Z     kernel = self.compile(
2025-05-07T20:32:16.5123018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5123682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5124082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5124323Z 
2025-05-07T20:32:16.5124535Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98a78b0>
2025-05-07T20:32:16.5125804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5127206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97b54c0>}
2025-05-07T20:32:16.5128559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5129645Z context = <triton._C.libtriton.ir.context object at 0x7fd7e978ec30>
2025-05-07T20:32:16.5129943Z 
2025-05-07T20:32:16.5130118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5130650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5131114Z                            module_map=module_map)
2025-05-07T20:32:16.5131493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5131854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5132162Z E       ^
2025-05-07T20:32:16.5132639Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5133104Z 
2025-05-07T20:32:16.5133520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5134043Z 
2025-05-07T20:32:16.5134154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5134565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5134975Z     T=16384,
2025-05-07T20:32:16.5135174Z     D=5120,
2025-05-07T20:32:16.5135366Z     scale_ub=None,
2025-05-07T20:32:16.5135594Z     contiguous=False,
2025-05-07T20:32:16.5135829Z     compiled=False,
2025-05-07T20:32:16.5136046Z )
2025-05-07T20:32:16.5136365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5136875Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5137163Z 
2025-05-07T20:32:16.5137252Z     @given(
2025-05-07T20:32:16.5137482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5137808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5138127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5138467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5138805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5139099Z     )
2025-05-07T20:32:16.5139456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5139907Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5140535Z         self,
2025-05-07T20:32:16.5140801Z         T: int,
2025-05-07T20:32:16.5141129Z         D: int,
2025-05-07T20:32:16.5141424Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5141787Z         contiguous: bool,
2025-05-07T20:32:16.5142052Z         compiled: bool,
2025-05-07T20:32:16.5142452Z     ) -> None:
2025-05-07T20:32:16.5142680Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5142924Z     
2025-05-07T20:32:16.5143207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5143557Z     
2025-05-07T20:32:16.5143750Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5144054Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5146091Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5147954Z 
2025-05-07T20:32:16.5148078Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.5148302Z 
2025-05-07T20:32:16.5148408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5148841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5149241Z     T=4096,
2025-05-07T20:32:16.5149502Z     D=7168,
2025-05-07T20:32:16.5149704Z     scale_ub=1200.0,
2025-05-07T20:32:16.5149929Z     contiguous=True,
2025-05-07T20:32:16.5150157Z     compiled=True,
2025-05-07T20:32:16.5150367Z )
2025-05-07T20:32:16.5150684Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5151183Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.5151455Z 
2025-05-07T20:32:16.5151544Z     @given(
2025-05-07T20:32:16.5151773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5152097Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5152490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5152829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5153162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5153457Z     )
2025-05-07T20:32:16.5153813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5154258Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5154510Z         self,
2025-05-07T20:32:16.5154714Z         T: int,
2025-05-07T20:32:16.5154913Z         D: int,
2025-05-07T20:32:16.5155143Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5155426Z         contiguous: bool,
2025-05-07T20:32:16.5155667Z         compiled: bool,
2025-05-07T20:32:16.5155898Z     ) -> None:
2025-05-07T20:32:16.5156124Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5156369Z     
2025-05-07T20:32:16.5156647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5156999Z     
2025-05-07T20:32:16.5157209Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5157500Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5159486Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5161372Z 
2025-05-07T20:32:16.5161495Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.5161711Z 
2025-05-07T20:32:16.5161823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5162240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5162774Z     T=16384,
2025-05-07T20:32:16.5162977Z     D=7168,
2025-05-07T20:32:16.5163173Z     scale_ub=None,
2025-05-07T20:32:16.5163398Z     contiguous=False,
2025-05-07T20:32:16.5163630Z     compiled=False,
2025-05-07T20:32:16.5163839Z )
2025-05-07T20:32:16.6211437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6212192Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6212482Z 
2025-05-07T20:32:16.6212574Z     @given(
2025-05-07T20:32:16.6212816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6213143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6213465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6213804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6214145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6214449Z     )
2025-05-07T20:32:16.6214818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6215286Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6215543Z         self,
2025-05-07T20:32:16.6215752Z         T: int,
2025-05-07T20:32:16.6215955Z         D: int,
2025-05-07T20:32:16.6216186Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6216748Z         contiguous: bool,
2025-05-07T20:32:16.6216996Z         compiled: bool,
2025-05-07T20:32:16.6217236Z     ) -> None:
2025-05-07T20:32:16.6217465Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6217712Z     
2025-05-07T20:32:16.6217997Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6220056Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6222188Z 
2025-05-07T20:32:16.6222322Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6222543Z 
2025-05-07T20:32:16.6222656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6223077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6223487Z     T=2048,
2025-05-07T20:32:16.6223688Z     D=7168,
2025-05-07T20:32:16.6223884Z     scale_ub=1200.0,
2025-05-07T20:32:16.6224119Z     contiguous=True,
2025-05-07T20:32:16.6224352Z     compiled=True,
2025-05-07T20:32:16.6224562Z )
2025-05-07T20:32:16.6224893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6225395Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6225682Z 
2025-05-07T20:32:16.6225764Z     @given(
2025-05-07T20:32:16.6226006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6226328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6226645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6226982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6227324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6227620Z     )
2025-05-07T20:32:16.6227974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6228425Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6228681Z         self,
2025-05-07T20:32:16.6228881Z         T: int,
2025-05-07T20:32:16.6229091Z         D: int,
2025-05-07T20:32:16.6229323Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6229603Z         contiguous: bool,
2025-05-07T20:32:16.6229855Z         compiled: bool,
2025-05-07T20:32:16.6230090Z     ) -> None:
2025-05-07T20:32:16.6230453Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6230715Z     
2025-05-07T20:32:16.6230997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6231348Z     
2025-05-07T20:32:16.6231547Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6231851Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6233828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6235648Z 
2025-05-07T20:32:16.6235793Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6236009Z 
2025-05-07T20:32:16.6236115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6236539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6236945Z     T=2048,
2025-05-07T20:32:16.6237208Z     D=7168,
2025-05-07T20:32:16.6237404Z     scale_ub=None,
2025-05-07T20:32:16.6237627Z     contiguous=True,
2025-05-07T20:32:16.6237861Z     compiled=False,
2025-05-07T20:32:16.6238070Z )
2025-05-07T20:32:16.6238395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6238894Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6239166Z 
2025-05-07T20:32:16.6239248Z     @given(
2025-05-07T20:32:16.6239503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6239829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6240486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6240935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6241283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6241578Z     )
2025-05-07T20:32:16.6241936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6242390Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6242634Z         self,
2025-05-07T20:32:16.6242840Z         T: int,
2025-05-07T20:32:16.6243048Z         D: int,
2025-05-07T20:32:16.6243270Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6243554Z         contiguous: bool,
2025-05-07T20:32:16.6243805Z         compiled: bool,
2025-05-07T20:32:16.6244033Z     ) -> None:
2025-05-07T20:32:16.6244262Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6244516Z     
2025-05-07T20:32:16.6244795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6245145Z     
2025-05-07T20:32:16.6245346Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.6247260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6249119Z 
2025-05-07T20:32:16.6249245Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.6249461Z 
2025-05-07T20:32:16.6249568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6249988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6250397Z     T=1,
2025-05-07T20:32:16.6250590Z     D=7168,
2025-05-07T20:32:16.6250784Z     scale_ub=1200.0,
2025-05-07T20:32:16.6251141Z     contiguous=True,
2025-05-07T20:32:16.6251375Z     compiled=False,
2025-05-07T20:32:16.6251582Z )
2025-05-07T20:32:16.7821031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.7821943Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.7822335Z 
2025-05-07T20:32:16.7822461Z     @given(
2025-05-07T20:32:16.7822740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.7823062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.7823389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.7823740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.7824082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.7824386Z     )
2025-05-07T20:32:16.7824755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.7825205Z     def test_silu_mul_quant(
2025-05-07T20:32:16.7825480Z         self,
2025-05-07T20:32:16.7825690Z         T: int,
2025-05-07T20:32:16.7825891Z         D: int,
2025-05-07T20:32:16.7826125Z         scale_ub: Optional[float],
2025-05-07T20:32:16.7826415Z         contiguous: bool,
2025-05-07T20:32:16.7826666Z         compiled: bool,
2025-05-07T20:32:16.7827092Z     ) -> None:
2025-05-07T20:32:16.7827324Z         torch.manual_seed(2025)
2025-05-07T20:32:16.7827573Z     
2025-05-07T20:32:16.7827863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.7828222Z     
2025-05-07T20:32:16.7828429Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.7828733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.7829064Z         x = x_sign * x_clamp
2025-05-07T20:32:16.7829321Z         x0 = x[:, :D]
2025-05-07T20:32:16.7829548Z         x1 = x[:, D:]
2025-05-07T20:32:16.7829775Z     
2025-05-07T20:32:16.7829975Z         if contiguous:
2025-05-07T20:32:16.7830222Z             x0 = x0.contiguous()
2025-05-07T20:32:16.7830604Z             x1 = x1.contiguous()
2025-05-07T20:32:16.7830858Z     
2025-05-07T20:32:16.7831061Z         if scale_ub is not None:
2025-05-07T20:32:16.7831352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.7831711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.7832036Z             )
2025-05-07T20:32:16.7832246Z         else:
2025-05-07T20:32:16.7832472Z             scale_ub_tensor = None
2025-05-07T20:32:16.7832733Z     
2025-05-07T20:32:16.7832978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.7833308Z             op = silu_mul_quant
2025-05-07T20:32:16.7833574Z             if compiled:
2025-05-07T20:32:16.7833830Z                 op = torch.compile(op)
2025-05-07T20:32:16.7834140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.7834428Z     
2025-05-07T20:32:16.7834625Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.7834801Z 
2025-05-07T20:32:16.7834906Z moe/activation_test.py:117: 
2025-05-07T20:32:16.7835221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7835559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.7835854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.7836569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.7837276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.7837822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.7838519Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.7839197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.7839735Z     kernel = self.compile(
2025-05-07T20:32:16.7840756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.7841446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.7841865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7842100Z 
2025-05-07T20:32:16.7842311Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9646370>
2025-05-07T20:32:16.7843402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.7844781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e93ec040>}
2025-05-07T20:32:16.7846133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.7847169Z context = <triton._C.libtriton.ir.context object at 0x7fd7e93eb2f0>
2025-05-07T20:32:16.7847461Z 
2025-05-07T20:32:16.7847633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.7848234Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.7848712Z                            module_map=module_map)
2025-05-07T20:32:16.7849084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.7849450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.7849723Z E       ^
2025-05-07T20:32:16.7850197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.7850657Z 
2025-05-07T20:32:16.7851076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.7851688Z 
2025-05-07T20:32:16.7851798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.7852227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.7852634Z     T=128,
2025-05-07T20:32:16.7852839Z     D=5120,
2025-05-07T20:32:16.7853054Z     scale_ub=None,
2025-05-07T20:32:16.7853281Z     contiguous=True,
2025-05-07T20:32:16.7853519Z     compiled=False,
2025-05-07T20:32:16.7853742Z )
2025-05-07T20:32:16.7854076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.7854574Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.7854853Z 
2025-05-07T20:32:16.7854936Z     @given(
2025-05-07T20:32:16.7855185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.7855507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.7855831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.7856186Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.7856521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.7856819Z     )
2025-05-07T20:32:16.7857180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.7857635Z     def test_silu_mul_quant(
2025-05-07T20:32:16.7857886Z         self,
2025-05-07T20:32:16.7858091Z         T: int,
2025-05-07T20:32:16.7858298Z         D: int,
2025-05-07T20:32:16.7858526Z         scale_ub: Optional[float],
2025-05-07T20:32:16.7858810Z         contiguous: bool,
2025-05-07T20:32:16.7859064Z         compiled: bool,
2025-05-07T20:32:16.7859292Z     ) -> None:
2025-05-07T20:32:16.7859520Z         torch.manual_seed(2025)
2025-05-07T20:32:16.7859777Z     
2025-05-07T20:32:16.7860055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.7860414Z     
2025-05-07T20:32:16.7860621Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.7861004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.7861411Z         x = x_sign * x_clamp
2025-05-07T20:32:16.7861666Z         x0 = x[:, :D]
2025-05-07T20:32:16.7861888Z         x1 = x[:, D:]
2025-05-07T20:32:16.7862113Z     
2025-05-07T20:32:16.7862314Z         if contiguous:
2025-05-07T20:32:16.7862553Z             x0 = x0.contiguous()
2025-05-07T20:32:16.7862831Z             x1 = x1.contiguous()
2025-05-07T20:32:16.7863088Z     
2025-05-07T20:32:16.7863295Z         if scale_ub is not None:
2025-05-07T20:32:16.7863576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.7863926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.7864248Z             )
2025-05-07T20:32:16.7864448Z         else:
2025-05-07T20:32:16.7864677Z             scale_ub_tensor = None
2025-05-07T20:32:16.7864942Z     
2025-05-07T20:32:16.7865182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.7865513Z             op = silu_mul_quant
2025-05-07T20:32:16.7865791Z             if compiled:
2025-05-07T20:32:16.7866048Z                 op = torch.compile(op)
2025-05-07T20:32:16.7866363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.7866653Z     
2025-05-07T20:32:16.7866853Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.7867030Z 
2025-05-07T20:32:16.7867134Z moe/activation_test.py:117: 
2025-05-07T20:32:16.7867500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7867848Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.7868140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.7868843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.7869542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.7870088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.7870786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.7871513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.7872062Z     kernel = self.compile(
2025-05-07T20:32:16.7872618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.7873287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.7873697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.7873932Z 
2025-05-07T20:32:16.7874152Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e940b760>
2025-05-07T20:32:16.7875238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.7876622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e93ec9d0>}
2025-05-07T20:32:16.7877965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.7878991Z context = <triton._C.libtriton.ir.context object at 0x7fd7e966a7f0>
2025-05-07T20:32:16.7879288Z 
2025-05-07T20:32:16.7879459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.7879995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.7880470Z                            module_map=module_map)
2025-05-07T20:32:16.7880848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.7881290Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.7881756Z E       ^
2025-05-07T20:32:16.7882315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.7882803Z 
2025-05-07T20:32:16.7883424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.7891538Z 
2025-05-07T20:32:16.7891679Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.7892121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.7892545Z     T=128,
2025-05-07T20:32:16.7892748Z     D=7168,
2025-05-07T20:32:16.7892947Z     scale_ub=None,
2025-05-07T20:32:16.7893177Z     contiguous=True,
2025-05-07T20:32:16.7893417Z     compiled=False,
2025-05-07T20:32:16.7893631Z )
2025-05-07T20:32:16.8794610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.8795184Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.8795495Z 
2025-05-07T20:32:16.8795591Z     @given(
2025-05-07T20:32:16.8795838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.8796176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.8796505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.8797143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.8797485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.8797796Z     )
2025-05-07T20:32:16.8798171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.8798626Z     def test_silu_mul_quant(
2025-05-07T20:32:16.8798896Z         self,
2025-05-07T20:32:16.8799116Z         T: int,
2025-05-07T20:32:16.8799325Z         D: int,
2025-05-07T20:32:16.8799568Z         scale_ub: Optional[float],
2025-05-07T20:32:16.8799862Z         contiguous: bool,
2025-05-07T20:32:16.8800113Z         compiled: bool,
2025-05-07T20:32:16.8800361Z     ) -> None:
2025-05-07T20:32:16.8800702Z         torch.manual_seed(2025)
2025-05-07T20:32:16.8800955Z     
2025-05-07T20:32:16.8801255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.8801622Z     
2025-05-07T20:32:16.8801840Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.8802146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.8802487Z         x = x_sign * x_clamp
2025-05-07T20:32:16.8802755Z         x0 = x[:, :D]
2025-05-07T20:32:16.8802983Z         x1 = x[:, D:]
2025-05-07T20:32:16.8803209Z     
2025-05-07T20:32:16.8803417Z         if contiguous:
2025-05-07T20:32:16.8803660Z             x0 = x0.contiguous()
2025-05-07T20:32:16.8803943Z             x1 = x1.contiguous()
2025-05-07T20:32:16.8804203Z     
2025-05-07T20:32:16.8804406Z         if scale_ub is not None:
2025-05-07T20:32:16.8804706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.8805063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.8805390Z             )
2025-05-07T20:32:16.8805607Z         else:
2025-05-07T20:32:16.8805836Z             scale_ub_tensor = None
2025-05-07T20:32:16.8806098Z     
2025-05-07T20:32:16.8806349Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.8806689Z             op = silu_mul_quant
2025-05-07T20:32:16.8806964Z             if compiled:
2025-05-07T20:32:16.8807226Z                 op = torch.compile(op)
2025-05-07T20:32:16.8807549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8807849Z     
2025-05-07T20:32:16.8808055Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.8808241Z 
2025-05-07T20:32:16.8808348Z moe/activation_test.py:117: 
2025-05-07T20:32:16.8808672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8809017Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.8809328Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8810191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.8810913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.8811468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.8812179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.8812874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.8813423Z     kernel = self.compile(
2025-05-07T20:32:16.8813988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.8814662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.8815081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8815321Z 
2025-05-07T20:32:16.8815545Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e967f700>
2025-05-07T20:32:16.8816643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.8818093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e94f1430>}
2025-05-07T20:32:16.8819451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.8820497Z context = <triton._C.libtriton.ir.context object at 0x7fd7e94e3eb0>
2025-05-07T20:32:16.8820805Z 
2025-05-07T20:32:16.8820980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.8821623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.8822165Z                            module_map=module_map)
2025-05-07T20:32:16.8822545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.8822942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.8823259Z E       ^
2025-05-07T20:32:16.8823738Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.8824210Z 
2025-05-07T20:32:16.8824632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.8825159Z 
2025-05-07T20:32:16.8825270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.8825704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.8826118Z     T=2048,
2025-05-07T20:32:16.8826327Z     D=7168,
2025-05-07T20:32:16.8826536Z     scale_ub=1200.0,
2025-05-07T20:32:16.8826780Z     contiguous=True,
2025-05-07T20:32:16.8827026Z     compiled=False,
2025-05-07T20:32:16.8827256Z )
2025-05-07T20:32:16.8827588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.8828103Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.8828397Z 
2025-05-07T20:32:16.8828481Z     @given(
2025-05-07T20:32:16.8828726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.8829050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.8829380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.8829728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.8830068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.8830374Z     )
2025-05-07T20:32:16.8830741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.8831200Z     def test_silu_mul_quant(
2025-05-07T20:32:16.8831585Z         self,
2025-05-07T20:32:16.8831802Z         T: int,
2025-05-07T20:32:16.8832022Z         D: int,
2025-05-07T20:32:16.8832252Z         scale_ub: Optional[float],
2025-05-07T20:32:16.8832542Z         contiguous: bool,
2025-05-07T20:32:16.8832800Z         compiled: bool,
2025-05-07T20:32:16.8833037Z     ) -> None:
2025-05-07T20:32:16.8833275Z         torch.manual_seed(2025)
2025-05-07T20:32:16.8833535Z     
2025-05-07T20:32:16.8833818Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.8835876Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.8837750Z 
2025-05-07T20:32:16.8837873Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.8838097Z 
2025-05-07T20:32:16.8838204Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.8838680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.8839085Z     T=1,
2025-05-07T20:32:16.8839282Z     D=5120,
2025-05-07T20:32:16.8839487Z     scale_ub=1200.0,
2025-05-07T20:32:16.8839717Z     contiguous=True,
2025-05-07T20:32:16.8839954Z     compiled=False,
2025-05-07T20:32:16.8840456Z )
2025-05-07T20:32:16.9329157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9329712Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.9329980Z 
2025-05-07T20:32:16.9330064Z     @given(
2025-05-07T20:32:16.9330308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9330833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9331153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9331489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9331830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9332134Z     )
2025-05-07T20:32:16.9332488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9332945Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9333200Z         self,
2025-05-07T20:32:16.9333402Z         T: int,
2025-05-07T20:32:16.9333612Z         D: int,
2025-05-07T20:32:16.9333847Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9334123Z         contiguous: bool,
2025-05-07T20:32:16.9334376Z         compiled: bool,
2025-05-07T20:32:16.9334616Z     ) -> None:
2025-05-07T20:32:16.9334837Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9335094Z     
2025-05-07T20:32:16.9335386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9335746Z     
2025-05-07T20:32:16.9335943Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.9336247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.9336576Z         x = x_sign * x_clamp
2025-05-07T20:32:16.9336829Z         x0 = x[:, :D]
2025-05-07T20:32:16.9337064Z         x1 = x[:, D:]
2025-05-07T20:32:16.9337285Z     
2025-05-07T20:32:16.9337478Z         if contiguous:
2025-05-07T20:32:16.9337723Z             x0 = x0.contiguous()
2025-05-07T20:32:16.9337996Z             x1 = x1.contiguous()
2025-05-07T20:32:16.9338245Z     
2025-05-07T20:32:16.9338450Z         if scale_ub is not None:
2025-05-07T20:32:16.9338740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.9339083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.9339410Z             )
2025-05-07T20:32:16.9339621Z         else:
2025-05-07T20:32:16.9339841Z             scale_ub_tensor = None
2025-05-07T20:32:16.9340514Z     
2025-05-07T20:32:16.9340769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.9341204Z             op = silu_mul_quant
2025-05-07T20:32:16.9341470Z             if compiled:
2025-05-07T20:32:16.9341737Z                 op = torch.compile(op)
2025-05-07T20:32:16.9342051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.9342338Z     
2025-05-07T20:32:16.9342548Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.9342721Z 
2025-05-07T20:32:16.9342835Z moe/activation_test.py:117: 
2025-05-07T20:32:16.9343139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.9343490Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.9343788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.9344483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.9345189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.9345747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.9346440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.9347112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.9347737Z     kernel = self.compile(
2025-05-07T20:32:16.9348292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.9348959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.9349363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.9349606Z 
2025-05-07T20:32:16.9349820Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e95d9f10>
2025-05-07T20:32:16.9350911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.9352367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9412160>}
2025-05-07T20:32:16.9353789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.9354815Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9423f30>
2025-05-07T20:32:16.9355111Z 
2025-05-07T20:32:16.9355293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.9355822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.9356318Z                            module_map=module_map)
2025-05-07T20:32:16.9356704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.9357073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.9357342Z E       ^
2025-05-07T20:32:16.9357818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.9358273Z 
2025-05-07T20:32:16.9358700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.9359226Z 
2025-05-07T20:32:16.9359340Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9359759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9360174Z     T=2048,
2025-05-07T20:32:16.9360379Z     D=5120,
2025-05-07T20:32:16.9360580Z     scale_ub=None,
2025-05-07T20:32:16.9360808Z     contiguous=True,
2025-05-07T20:32:16.9361046Z     compiled=False,
2025-05-07T20:32:16.9361268Z )
2025-05-07T20:32:16.9361720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9362230Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.9362505Z 
2025-05-07T20:32:16.9362588Z     @given(
2025-05-07T20:32:16.9362838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9363169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9363491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9363827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9364169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9364468Z     )
2025-05-07T20:32:16.9364825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9365281Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9365539Z         self,
2025-05-07T20:32:16.9365739Z         T: int,
2025-05-07T20:32:16.9365949Z         D: int,
2025-05-07T20:32:16.9366192Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9366471Z         contiguous: bool,
2025-05-07T20:32:16.9366729Z         compiled: bool,
2025-05-07T20:32:16.9366965Z     ) -> None:
2025-05-07T20:32:16.9367188Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9367445Z     
2025-05-07T20:32:16.9367729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9368144Z     
2025-05-07T20:32:16.9368345Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.9370292Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9372208Z 
2025-05-07T20:32:16.9372332Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.9372552Z 
2025-05-07T20:32:16.9372667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9373088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9373509Z     T=16384,
2025-05-07T20:32:16.9373716Z     D=5120,
2025-05-07T20:32:16.9373921Z     scale_ub=None,
2025-05-07T20:32:16.9374149Z     contiguous=True,
2025-05-07T20:32:16.9374388Z     compiled=False,
2025-05-07T20:32:16.9374606Z )
2025-05-07T20:32:16.9374935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9375445Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.9375723Z 
2025-05-07T20:32:16.9375812Z     @given(
2025-05-07T20:32:16.9376051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9376390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9376711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9377057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9377407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9377714Z     )
2025-05-07T20:32:16.9378080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9378530Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9378786Z         self,
2025-05-07T20:32:16.9378999Z         T: int,
2025-05-07T20:32:16.9379205Z         D: int,
2025-05-07T20:32:16.9379439Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9379724Z         contiguous: bool,
2025-05-07T20:32:16.9379970Z         compiled: bool,
2025-05-07T20:32:16.9380206Z     ) -> None:
2025-05-07T20:32:16.9380433Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9380682Z     
2025-05-07T20:32:16.9380965Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9383271Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9385154Z 
2025-05-07T20:32:16.9385280Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.9385500Z 
2025-05-07T20:32:16.9385616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9386033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9386445Z     T=4096,
2025-05-07T20:32:16.9386643Z     D=5120,
2025-05-07T20:32:16.9386850Z     scale_ub=None,
2025-05-07T20:32:16.9387079Z     contiguous=True,
2025-05-07T20:32:16.9387316Z     compiled=False,
2025-05-07T20:32:16.9387528Z )
2025-05-07T20:32:17.0425111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0425775Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.0426267Z 
2025-05-07T20:32:17.0426353Z     @given(
2025-05-07T20:32:17.0426601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0426922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0427247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0427595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0427944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0428236Z     )
2025-05-07T20:32:17.0428598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0429060Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0429427Z         self,
2025-05-07T20:32:17.0429643Z         T: int,
2025-05-07T20:32:17.0429858Z         D: int,
2025-05-07T20:32:17.0430086Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0430379Z         contiguous: bool,
2025-05-07T20:32:17.0430635Z         compiled: bool,
2025-05-07T20:32:17.0430875Z     ) -> None:
2025-05-07T20:32:17.0431106Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0431367Z     
2025-05-07T20:32:17.0431647Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0433720Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.0435603Z 
2025-05-07T20:32:17.0435729Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.0435955Z 
2025-05-07T20:32:17.0436063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0436493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0436900Z     T=2048,
2025-05-07T20:32:17.0437102Z     D=5120,
2025-05-07T20:32:17.0437311Z     scale_ub=None,
2025-05-07T20:32:17.0437536Z     contiguous=False,
2025-05-07T20:32:17.0437782Z     compiled=False,
2025-05-07T20:32:17.0438002Z )
2025-05-07T20:32:17.0438323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0438831Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.0439118Z 
2025-05-07T20:32:17.0439202Z     @given(
2025-05-07T20:32:17.0439579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0439905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0440513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0440858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0441195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0441497Z     )
2025-05-07T20:32:17.0441865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0442317Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0442578Z         self,
2025-05-07T20:32:17.0442788Z         T: int,
2025-05-07T20:32:17.0443000Z         D: int,
2025-05-07T20:32:17.0443227Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0443511Z         contiguous: bool,
2025-05-07T20:32:17.0443763Z         compiled: bool,
2025-05-07T20:32:17.0443993Z     ) -> None:
2025-05-07T20:32:17.0444222Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0444477Z     
2025-05-07T20:32:17.0444767Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0446784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.0448737Z 
2025-05-07T20:32:17.0448862Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.0449088Z 
2025-05-07T20:32:17.0449195Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0449621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0450099Z     T=4096,
2025-05-07T20:32:17.0450303Z     D=7168,
2025-05-07T20:32:17.0450510Z     scale_ub=None,
2025-05-07T20:32:17.0450735Z     contiguous=True,
2025-05-07T20:32:17.0450975Z     compiled=True,
2025-05-07T20:32:17.0451199Z )
2025-05-07T20:32:17.0451528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0452034Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.0452311Z 
2025-05-07T20:32:17.0452407Z     @given(
2025-05-07T20:32:17.0452662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0453031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0453357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0453702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0454042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0454346Z     )
2025-05-07T20:32:17.0454716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0455170Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0455432Z         self,
2025-05-07T20:32:17.0455641Z         T: int,
2025-05-07T20:32:17.0455849Z         D: int,
2025-05-07T20:32:17.0456084Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0456371Z         contiguous: bool,
2025-05-07T20:32:17.0456623Z         compiled: bool,
2025-05-07T20:32:17.0456860Z     ) -> None:
2025-05-07T20:32:17.0457089Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0457340Z     
2025-05-07T20:32:17.0457623Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0459765Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.0461744Z 
2025-05-07T20:32:17.0461870Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.0462092Z 
2025-05-07T20:32:17.0462208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0462627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0463046Z     T=2048,
2025-05-07T20:32:17.0463249Z     D=5120,
2025-05-07T20:32:17.0463452Z     scale_ub=1200.0,
2025-05-07T20:32:17.0463690Z     contiguous=False,
2025-05-07T20:32:17.0463931Z     compiled=False,
2025-05-07T20:32:17.0464144Z )
2025-05-07T20:32:17.0464474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0464985Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.0465270Z 
2025-05-07T20:32:17.0465365Z     @given(
2025-05-07T20:32:17.0465602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0465934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0466256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0466595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0466996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0467296Z     )
2025-05-07T20:32:17.0467652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0468107Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0468378Z         self,
2025-05-07T20:32:17.0468589Z         T: int,
2025-05-07T20:32:17.0468795Z         D: int,
2025-05-07T20:32:17.0469029Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0469449Z         contiguous: bool,
2025-05-07T20:32:17.0469743Z         compiled: bool,
2025-05-07T20:32:17.0478544Z     ) -> None:
2025-05-07T20:32:17.0478881Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0479138Z     
2025-05-07T20:32:17.0479427Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0481468Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.0483334Z 
2025-05-07T20:32:17.0483457Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.0483680Z 
2025-05-07T20:32:17.0483794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0484224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0484650Z     T=4096,
2025-05-07T20:32:17.0484856Z     D=7168,
2025-05-07T20:32:17.0485059Z     scale_ub=1200.0,
2025-05-07T20:32:17.0485303Z     contiguous=True,
2025-05-07T20:32:17.0485546Z     compiled=False,
2025-05-07T20:32:17.0485761Z )
2025-05-07T20:32:17.0486106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0486626Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.0486904Z 
2025-05-07T20:32:17.0487000Z     @given(
2025-05-07T20:32:17.0487239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0487577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0487904Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0488247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0488598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0488906Z     )
2025-05-07T20:32:17.0489350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0489813Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0490073Z         self,
2025-05-07T20:32:17.0490276Z         T: int,
2025-05-07T20:32:17.0490490Z         D: int,
2025-05-07T20:32:17.0490730Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0491009Z         contiguous: bool,
2025-05-07T20:32:17.0491266Z         compiled: bool,
2025-05-07T20:32:17.0491505Z     ) -> None:
2025-05-07T20:32:17.0491741Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0491991Z     
2025-05-07T20:32:17.0492278Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0494367Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.0496285Z 
2025-05-07T20:32:17.0496418Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.0496639Z 
2025-05-07T20:32:17.0496748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0497179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0497603Z     T=16384,
2025-05-07T20:32:17.0497816Z     D=7168,
2025-05-07T20:32:17.0498015Z     scale_ub=None,
2025-05-07T20:32:17.0498248Z     contiguous=False,
2025-05-07T20:32:17.0498491Z     compiled=True,
2025-05-07T20:32:17.0498701Z )
2025-05-07T20:32:17.3865608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.3866223Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.3866848Z 
2025-05-07T20:32:17.3866932Z     @given(
2025-05-07T20:32:17.3867182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.3867503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.3867837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.3868198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.3868546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.3868845Z     )
2025-05-07T20:32:17.3869212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.3869670Z     def test_silu_mul_quant(
2025-05-07T20:32:17.3869923Z         self,
2025-05-07T20:32:17.3870134Z         T: int,
2025-05-07T20:32:17.3870350Z         D: int,
2025-05-07T20:32:17.3870580Z         scale_ub: Optional[float],
2025-05-07T20:32:17.3870871Z         contiguous: bool,
2025-05-07T20:32:17.3871133Z         compiled: bool,
2025-05-07T20:32:17.3871381Z     ) -> None:
2025-05-07T20:32:17.3871611Z         torch.manual_seed(2025)
2025-05-07T20:32:17.3871869Z     
2025-05-07T20:32:17.3872149Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.3874237Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.3876100Z 
2025-05-07T20:32:17.3876225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.3876451Z 
2025-05-07T20:32:17.3876719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.3877152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.3877567Z     T=4096,
2025-05-07T20:32:17.3877769Z     D=7168,
2025-05-07T20:32:17.3877976Z     scale_ub=None,
2025-05-07T20:32:17.3878196Z     contiguous=True,
2025-05-07T20:32:17.3878436Z     compiled=False,
2025-05-07T20:32:17.3878660Z )
2025-05-07T20:32:17.3878986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.3879496Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.3879780Z 
2025-05-07T20:32:17.3879864Z     @given(
2025-05-07T20:32:17.3880104Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.3880425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.3880749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.3881094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.3881441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.3881739Z     )
2025-05-07T20:32:17.3882101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.3882547Z     def test_silu_mul_quant(
2025-05-07T20:32:17.3882800Z         self,
2025-05-07T20:32:17.3883008Z         T: int,
2025-05-07T20:32:17.3883294Z         D: int,
2025-05-07T20:32:17.3883524Z         scale_ub: Optional[float],
2025-05-07T20:32:17.3883812Z         contiguous: bool,
2025-05-07T20:32:17.3884067Z         compiled: bool,
2025-05-07T20:32:17.3884295Z     ) -> None:
2025-05-07T20:32:17.3884522Z         torch.manual_seed(2025)
2025-05-07T20:32:17.3884777Z     
2025-05-07T20:32:17.3885049Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.3887079Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.3889000Z 
2025-05-07T20:32:17.3889123Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.3889347Z 
2025-05-07T20:32:17.3889454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.3889879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.3890283Z     T=16384,
2025-05-07T20:32:17.3890495Z     D=7168,
2025-05-07T20:32:17.3890697Z     scale_ub=None,
2025-05-07T20:32:17.3890915Z     contiguous=True,
2025-05-07T20:32:17.3891157Z     compiled=False,
2025-05-07T20:32:17.3891374Z )
2025-05-07T20:32:17.3891697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.3892203Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.3892476Z 
2025-05-07T20:32:17.3892563Z     @given(
2025-05-07T20:32:17.3892792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.3893128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.3893484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.3893823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.3894156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.3894451Z     )
2025-05-07T20:32:17.3894810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.3895252Z     def test_silu_mul_quant(
2025-05-07T20:32:17.3895502Z         self,
2025-05-07T20:32:17.3895710Z         T: int,
2025-05-07T20:32:17.3895910Z         D: int,
2025-05-07T20:32:17.3896138Z         scale_ub: Optional[float],
2025-05-07T20:32:17.3896502Z         contiguous: bool,
2025-05-07T20:32:17.3896747Z         compiled: bool,
2025-05-07T20:32:17.3896981Z     ) -> None:
2025-05-07T20:32:17.3897208Z         torch.manual_seed(2025)
2025-05-07T20:32:17.3897454Z     
2025-05-07T20:32:17.3897733Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.3899762Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.3901707Z 
2025-05-07T20:32:17.3901830Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.3902056Z 
2025-05-07T20:32:17.3902170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.3902586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.3902995Z     T=16384,
2025-05-07T20:32:17.3903200Z     D=7168,
2025-05-07T20:32:17.3903446Z     scale_ub=1200.0,
2025-05-07T20:32:17.3903682Z     contiguous=True,
2025-05-07T20:32:17.3903911Z     compiled=False,
2025-05-07T20:32:17.3904118Z )
2025-05-07T20:32:17.3904445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.3904951Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.3905228Z 
2025-05-07T20:32:17.3905318Z     @given(
2025-05-07T20:32:17.3905548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.3905872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.3906190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.3906573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.3906912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.3907211Z     )
2025-05-07T20:32:17.3907563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.3908016Z     def test_silu_mul_quant(
2025-05-07T20:32:17.3908275Z         self,
2025-05-07T20:32:17.3908480Z         T: int,
2025-05-07T20:32:17.3908680Z         D: int,
2025-05-07T20:32:17.3908912Z         scale_ub: Optional[float],
2025-05-07T20:32:17.3909195Z         contiguous: bool,
2025-05-07T20:32:17.3909435Z         compiled: bool,
2025-05-07T20:32:17.3909665Z     ) -> None:
2025-05-07T20:32:17.3909894Z         torch.manual_seed(2025)
2025-05-07T20:32:17.3910141Z     
2025-05-07T20:32:17.3910419Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.3912493Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.3914331Z 
2025-05-07T20:32:17.3914460Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.3914676Z 
2025-05-07T20:32:17.3914788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.3915205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.3915613Z     T=128,
2025-05-07T20:32:17.3915808Z     D=5120,
2025-05-07T20:32:17.3916000Z     scale_ub=1200.0,
2025-05-07T20:32:17.3916252Z     contiguous=False,
2025-05-07T20:32:17.3916485Z     compiled=False,
2025-05-07T20:32:17.3916710Z )
2025-05-07T20:32:17.5553035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5553598Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5553877Z 
2025-05-07T20:32:17.5553964Z     @given(
2025-05-07T20:32:17.5554208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5554542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5554864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5555203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5555546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5555854Z     )
2025-05-07T20:32:17.5556205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5556661Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5556920Z         self,
2025-05-07T20:32:17.5557121Z         T: int,
2025-05-07T20:32:17.5557337Z         D: int,
2025-05-07T20:32:17.5557586Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5557867Z         contiguous: bool,
2025-05-07T20:32:17.5558117Z         compiled: bool,
2025-05-07T20:32:17.5558359Z     ) -> None:
2025-05-07T20:32:17.5558579Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5558832Z     
2025-05-07T20:32:17.5559186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5559539Z     
2025-05-07T20:32:17.5559736Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5560040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5560364Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5560614Z         x0 = x[:, :D]
2025-05-07T20:32:17.5560843Z         x1 = x[:, D:]
2025-05-07T20:32:17.5561072Z     
2025-05-07T20:32:17.5561262Z         if contiguous:
2025-05-07T20:32:17.5561509Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5561784Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5562034Z     
2025-05-07T20:32:17.5562327Z         if scale_ub is not None:
2025-05-07T20:32:17.5562620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5563001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5563322Z             )
2025-05-07T20:32:17.5563528Z         else:
2025-05-07T20:32:17.5563740Z             scale_ub_tensor = None
2025-05-07T20:32:17.5564006Z     
2025-05-07T20:32:17.5564249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5564576Z             op = silu_mul_quant
2025-05-07T20:32:17.5564834Z             if compiled:
2025-05-07T20:32:17.5565093Z                 op = torch.compile(op)
2025-05-07T20:32:17.5565419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5565706Z     
2025-05-07T20:32:17.5565901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5566081Z 
2025-05-07T20:32:17.5566186Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5566492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5566844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5567135Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5567837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5568543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5569086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5569781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5570460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5570994Z     kernel = self.compile(
2025-05-07T20:32:17.5571549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5572298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5572717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5572952Z 
2025-05-07T20:32:17.5573163Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e93033a0>
2025-05-07T20:32:17.5574245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5575646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9301ca0>}
2025-05-07T20:32:17.5576985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5578004Z context = <triton._C.libtriton.ir.context object at 0x7fd7e92623f0>
2025-05-07T20:32:17.5578304Z 
2025-05-07T20:32:17.5578475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5579006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5579584Z                            module_map=module_map)
2025-05-07T20:32:17.5579952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5580323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5580591Z E       ^
2025-05-07T20:32:17.5581053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5581583Z 
2025-05-07T20:32:17.5582009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5582528Z 
2025-05-07T20:32:17.5582634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5583115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5583519Z     T=2048,
2025-05-07T20:32:17.5583716Z     D=7168,
2025-05-07T20:32:17.5583921Z     scale_ub=None,
2025-05-07T20:32:17.5584142Z     contiguous=False,
2025-05-07T20:32:17.5584379Z     compiled=False,
2025-05-07T20:32:17.5584603Z )
2025-05-07T20:32:17.5584922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5585427Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5585709Z 
2025-05-07T20:32:17.5585791Z     @given(
2025-05-07T20:32:17.5586031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5586348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5586667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5587008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5587338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5587646Z     )
2025-05-07T20:32:17.5588003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5588453Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5588701Z         self,
2025-05-07T20:32:17.5588904Z         T: int,
2025-05-07T20:32:17.5589112Z         D: int,
2025-05-07T20:32:17.5589335Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5589617Z         contiguous: bool,
2025-05-07T20:32:17.5589869Z         compiled: bool,
2025-05-07T20:32:17.5590094Z     ) -> None:
2025-05-07T20:32:17.5590319Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5590574Z     
2025-05-07T20:32:17.5590849Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5592986Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5594832Z 
2025-05-07T20:32:17.5594955Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5595181Z 
2025-05-07T20:32:17.5595286Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5595711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5596113Z     T=128,
2025-05-07T20:32:17.5596313Z     D=7168,
2025-05-07T20:32:17.5596519Z     scale_ub=1200.0,
2025-05-07T20:32:17.5596746Z     contiguous=True,
2025-05-07T20:32:17.5596979Z     compiled=True,
2025-05-07T20:32:17.5597189Z )
2025-05-07T20:32:17.6052224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.6052759Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.6053063Z 
2025-05-07T20:32:17.6053144Z     @given(
2025-05-07T20:32:17.6053378Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.6053696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.6054179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.6054522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.6054860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.6055152Z     )
2025-05-07T20:32:17.6055504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.6055954Z     def test_silu_mul_quant(
2025-05-07T20:32:17.6056202Z         self,
2025-05-07T20:32:17.6056398Z         T: int,
2025-05-07T20:32:17.6056602Z         D: int,
2025-05-07T20:32:17.6056830Z         scale_ub: Optional[float],
2025-05-07T20:32:17.6057106Z         contiguous: bool,
2025-05-07T20:32:17.6057439Z         compiled: bool,
2025-05-07T20:32:17.6057674Z     ) -> None:
2025-05-07T20:32:17.6057892Z         torch.manual_seed(2025)
2025-05-07T20:32:17.6058143Z     
2025-05-07T20:32:17.6058421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.6058765Z     
2025-05-07T20:32:17.6058971Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.6059270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.6059591Z         x = x_sign * x_clamp
2025-05-07T20:32:17.6059835Z         x0 = x[:, :D]
2025-05-07T20:32:17.6060060Z         x1 = x[:, D:]
2025-05-07T20:32:17.6060277Z     
2025-05-07T20:32:17.6060466Z         if contiguous:
2025-05-07T20:32:17.6060707Z             x0 = x0.contiguous()
2025-05-07T20:32:17.6060979Z             x1 = x1.contiguous()
2025-05-07T20:32:17.6061359Z     
2025-05-07T20:32:17.6061567Z         if scale_ub is not None:
2025-05-07T20:32:17.6061851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.6062203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.6062527Z             )
2025-05-07T20:32:17.6062729Z         else:
2025-05-07T20:32:17.6062942Z             scale_ub_tensor = None
2025-05-07T20:32:17.6063206Z     
2025-05-07T20:32:17.6063447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.6063773Z             op = silu_mul_quant
2025-05-07T20:32:17.6064035Z             if compiled:
2025-05-07T20:32:17.6064294Z                 op = torch.compile(op)
2025-05-07T20:32:17.6064594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.6064880Z     
2025-05-07T20:32:17.6065081Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.6065249Z 
2025-05-07T20:32:17.6065358Z moe/activation_test.py:117: 
2025-05-07T20:32:17.6065653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.6065994Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.6066283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.6066982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.6067563Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.6068223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.6068917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.6069461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.6070150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.6070819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.6071350Z     kernel = self.compile(
2025-05-07T20:32:17.6071900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.6072577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.6073032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.6073264Z 
2025-05-07T20:32:17.6073477Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e92afaf0>
2025-05-07T20:32:17.6074604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.6075981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e95ae280>}
2025-05-07T20:32:17.6077325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.6078387Z context = <triton._C.libtriton.ir.context object at 0x7fd7a0412df0>
2025-05-07T20:32:17.6078679Z 
2025-05-07T20:32:17.6078851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.6079379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.6079851Z                            module_map=module_map)
2025-05-07T20:32:17.6080218Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.6080580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.6080846Z E       ^
2025-05-07T20:32:17.6081335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.6081791Z 
2025-05-07T20:32:17.6082207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.6082735Z 
2025-05-07T20:32:17.6082852Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.6083278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.6083689Z     T=128,
2025-05-07T20:32:17.6083959Z     D=7168,
2025-05-07T20:32:17.6084265Z     scale_ub=1200.0,
2025-05-07T20:32:17.6084765Z     contiguous=True,
2025-05-07T20:32:17.6085105Z     compiled=False,
2025-05-07T20:32:17.6093967Z )
2025-05-07T20:32:17.6094325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.6094830Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.6095115Z 
2025-05-07T20:32:17.6095198Z     @given(
2025-05-07T20:32:17.6095446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.6095774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.6096089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.6096433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.6096911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.6097212Z     )
2025-05-07T20:32:17.6097582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.6098039Z     def test_silu_mul_quant(
2025-05-07T20:32:17.6098288Z         self,
2025-05-07T20:32:17.6098498Z         T: int,
2025-05-07T20:32:17.6098708Z         D: int,
2025-05-07T20:32:17.6098932Z         scale_ub: Optional[float],
2025-05-07T20:32:17.6099219Z         contiguous: bool,
2025-05-07T20:32:17.6099473Z         compiled: bool,
2025-05-07T20:32:17.6099707Z     ) -> None:
2025-05-07T20:32:17.6099930Z         torch.manual_seed(2025)
2025-05-07T20:32:17.6100186Z     
2025-05-07T20:32:17.6100470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.6100817Z     
2025-05-07T20:32:17.6101019Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.6101407Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.6103486Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.6105412Z 
2025-05-07T20:32:17.6105536Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.6105761Z 
2025-05-07T20:32:17.6105868Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.6106292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.6106705Z     T=128,
2025-05-07T20:32:17.6106896Z     D=5120,
2025-05-07T20:32:17.6107099Z     scale_ub=1200.0,
2025-05-07T20:32:17.6107387Z     contiguous=True,
2025-05-07T20:32:17.6107615Z     compiled=True,
2025-05-07T20:32:17.6107833Z )
2025-05-07T20:32:17.6108163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.6108663Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.6108944Z 
2025-05-07T20:32:17.6109028Z     @given(
2025-05-07T20:32:17.6109269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.6109589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.6109910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.6110253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.6110598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.6110892Z     )
2025-05-07T20:32:17.6111254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.6111705Z     def test_silu_mul_quant(
2025-05-07T20:32:17.6111949Z         self,
2025-05-07T20:32:17.6112153Z         T: int,
2025-05-07T20:32:17.6112365Z         D: int,
2025-05-07T20:32:17.6112590Z         scale_ub: Optional[float],
2025-05-07T20:32:17.6112874Z         contiguous: bool,
2025-05-07T20:32:17.6113129Z         compiled: bool,
2025-05-07T20:32:17.6113358Z     ) -> None:
2025-05-07T20:32:17.6113592Z         torch.manual_seed(2025)
2025-05-07T20:32:17.6113850Z     
2025-05-07T20:32:17.6114128Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.6114486Z     
2025-05-07T20:32:17.6114691Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.6114978Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.6117073Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.6118959Z 
2025-05-07T20:32:17.6119084Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.6119310Z 
2025-05-07T20:32:17.6119418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.6119843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.6120247Z     T=128,
2025-05-07T20:32:17.6120447Z     D=7168,
2025-05-07T20:32:17.6120649Z     scale_ub=None,
2025-05-07T20:32:17.6120868Z     contiguous=True,
2025-05-07T20:32:17.6121104Z     compiled=True,
2025-05-07T20:32:17.6121321Z )
2025-05-07T20:32:17.8643555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8644132Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.8644421Z 
2025-05-07T20:32:17.8644511Z     @given(
2025-05-07T20:32:17.8644759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8645084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8645406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8646057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8646411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8646702Z     )
2025-05-07T20:32:17.8647066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8647529Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8647780Z         self,
2025-05-07T20:32:17.8647990Z         T: int,
2025-05-07T20:32:17.8648200Z         D: int,
2025-05-07T20:32:17.8648424Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8648711Z         contiguous: bool,
2025-05-07T20:32:17.8648964Z         compiled: bool,
2025-05-07T20:32:17.8649304Z     ) -> None:
2025-05-07T20:32:17.8649535Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8649792Z     
2025-05-07T20:32:17.8650070Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8652119Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.8654066Z 
2025-05-07T20:32:17.8654191Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.8654418Z 
2025-05-07T20:32:17.8682150Z FAILED
2025-05-07T20:32:17.8682346Z 
2025-05-07T20:32:17.8682555Z =================================== FAILURES ===================================
2025-05-07T20:32:17.8683001Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:17.8683456Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:17.8684219Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:17.8684935Z   |     yield
2025-05-07T20:32:17.8685433Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:17.8686051Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:17.8686931Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:17.8687858Z   |     method()
2025-05-07T20:32:17.8688828Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:17.8689729Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8690448Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:17.8691191Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:17.8691773Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:17.8692345Z   +-+---------------- 1 ----------------
2025-05-07T20:32:17.8692680Z     | Traceback (most recent call last):
2025-05-07T20:32:17.8693490Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.8694391Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8696747Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.8698818Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.8699281Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8699707Z     |     T=2048,
2025-05-07T20:32:17.8699967Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:17.8700316Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:17.8700698Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:17.8701322Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:17.8701830Z     | )
2025-05-07T20:32:17.8702086Z     | 
2025-05-07T20:32:17.8702738Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:17.8703547Z     +---------------- 2 ----------------
2025-05-07T20:32:17.8703980Z     | Traceback (most recent call last):
2025-05-07T20:32:17.8704932Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.8705856Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8708240Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.8710465Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.8710970Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8711463Z     |     T=128,
2025-05-07T20:32:17.8711785Z     |     D=7168,
2025-05-07T20:32:17.8712116Z     |     scale_ub=None,
2025-05-07T20:32:17.8712498Z     |     contiguous=True,
2025-05-07T20:32:17.8712899Z     |     compiled=True,
2025-05-07T20:32:17.8713266Z     | )
2025-05-07T20:32:17.8713564Z     | 
2025-05-07T20:32:17.8714480Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.8715635Z     +---------------- 3 ----------------
2025-05-07T20:32:17.8716112Z     | Traceback (most recent call last):
2025-05-07T20:32:17.8717220Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.8718399Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8720936Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.8723015Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.8723477Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8723901Z     |     T=128,
2025-05-07T20:32:17.8724116Z     |     D=5120,
2025-05-07T20:32:17.8724336Z     |     scale_ub=1200.0,
2025-05-07T20:32:17.8724713Z     |     contiguous=True,
2025-05-07T20:32:17.8724967Z     |     compiled=True,
2025-05-07T20:32:17.8725198Z     | )
2025-05-07T20:32:17.8725391Z     | 
2025-05-07T20:32:17.8726018Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.8726639Z     +---------------- 4 ----------------
2025-05-07T20:32:17.8726953Z     | Traceback (most recent call last):
2025-05-07T20:32:17.8727687Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:17.8728497Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.8729163Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:17.8729879Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8730893Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:17.8731715Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.8732336Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:17.8733087Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8733850Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:17.8734915Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8736065Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:17.8737218Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8738359Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:17.8739406Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.8740663Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:17.8741612Z     |     fn()
2025-05-07T20:32:17.8742976Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:17.8743905Z     |     self.fn.run(
2025-05-07T20:32:17.8744643Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:17.8745419Z     |     kernel = self.compile(
2025-05-07T20:32:17.8746315Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:17.8747343Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8748332Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:17.8749425Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8750166Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8750664Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.8751064Z     | ^
2025-05-07T20:32:17.8751727Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8752521Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.8753155Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:17.8754020Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8754627Z     |     T=1,  # or any other generated value
2025-05-07T20:32:17.8755076Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:17.8755560Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:17.8756082Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:17.8756599Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:17.8757036Z     | )
2025-05-07T20:32:17.8757297Z     | 
2025-05-07T20:32:17.8758051Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.8759039Z     +------------------------------------
2025-05-07T20:32:17.8759565Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:17.8760130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8760725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8761306Z     T=1,
2025-05-07T20:32:17.8761574Z     D=5120,
2025-05-07T20:32:17.8761849Z     scale_ub=None,
2025-05-07T20:32:17.8762162Z     contiguous=True,
2025-05-07T20:32:17.8762488Z     compiled=True,
2025-05-07T20:32:17.8762787Z )
2025-05-07T20:32:17.8763254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8763941Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.8764311Z 
2025-05-07T20:32:17.8764427Z     @given(
2025-05-07T20:32:17.8764769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8765215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8765667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8766137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8766617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8767023Z     )
2025-05-07T20:32:17.8767512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8768148Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8768508Z         self,
2025-05-07T20:32:17.8768775Z         T: int,
2025-05-07T20:32:17.8769064Z         D: int,
2025-05-07T20:32:17.8769378Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8769764Z         contiguous: bool,
2025-05-07T20:32:17.8770120Z         compiled: bool,
2025-05-07T20:32:17.8770457Z     ) -> None:
2025-05-07T20:32:17.8770764Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8771116Z     
2025-05-07T20:32:17.8771629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8772124Z     
2025-05-07T20:32:17.8772411Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8772844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8773286Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8773629Z         x0 = x[:, :D]
2025-05-07T20:32:17.8773941Z         x1 = x[:, D:]
2025-05-07T20:32:17.8774252Z     
2025-05-07T20:32:17.8774514Z         if contiguous:
2025-05-07T20:32:17.8774854Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8775234Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8775581Z     
2025-05-07T20:32:17.8775867Z         if scale_ub is not None:
2025-05-07T20:32:17.8776270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8776745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8777182Z             )
2025-05-07T20:32:17.8777465Z         else:
2025-05-07T20:32:17.8777775Z             scale_ub_tensor = None
2025-05-07T20:32:17.8778149Z     
2025-05-07T20:32:17.8778476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8778928Z             op = silu_mul_quant
2025-05-07T20:32:17.8779273Z             if compiled:
2025-05-07T20:32:17.8779625Z                 op = torch.compile(op)
2025-05-07T20:32:17.8780102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8780488Z     
2025-05-07T20:32:17.8780769Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.8781299Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.8781723Z     
2025-05-07T20:32:17.8782069Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8782553Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.8783009Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.8783460Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.8783939Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8784452Z     
2025-05-07T20:32:17.8784739Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.8784984Z 
2025-05-07T20:32:17.8785132Z moe/activation_test.py:126: 
2025-05-07T20:32:17.8785559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8786032Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.8786500Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8787612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.8788674Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.8789443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8790403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8791395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.8792423Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8793490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.8794570Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8795602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.8796521Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.8797366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.8798104Z     fn()
2025-05-07T20:32:17.8798868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.8799705Z     self.fn.run(
2025-05-07T20:32:17.8800380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8801150Z     kernel = self.compile(
2025-05-07T20:32:17.8801920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8802834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8803388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8803713Z 
2025-05-07T20:32:17.8804000Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7edeba040>
2025-05-07T20:32:17.8805525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8807491Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7edee7040>}
2025-05-07T20:32:17.8809401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8810901Z context = <triton._C.libtriton.ir.context object at 0x7fd7ee9b7870>
2025-05-07T20:32:17.8811324Z 
2025-05-07T20:32:17.8811569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8812334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8813013Z                            module_map=module_map)
2025-05-07T20:32:17.8813569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8814063Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.8814519Z E       ^
2025-05-07T20:32:17.8815187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8815828Z 
2025-05-07T20:32:17.8816415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8817177Z 
2025-05-07T20:32:17.8817334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8817912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8818480Z     T=2048,
2025-05-07T20:32:17.8818747Z     D=5120,
2025-05-07T20:32:17.8819023Z     scale_ub=1200.0,
2025-05-07T20:32:17.8819340Z     contiguous=True,
2025-05-07T20:32:17.8819652Z     compiled=False,
2025-05-07T20:32:17.8819945Z )
2025-05-07T20:32:17.8820394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8821227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.8821626Z 
2025-05-07T20:32:17.8821746Z     @given(
2025-05-07T20:32:17.8822076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8822516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8822951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8823403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8823863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8824254Z     )
2025-05-07T20:32:17.8846327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8846977Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8847328Z         self,
2025-05-07T20:32:17.8847605Z         T: int,
2025-05-07T20:32:17.8847887Z         D: int,
2025-05-07T20:32:17.8848183Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8848558Z         contiguous: bool,
2025-05-07T20:32:17.8848879Z         compiled: bool,
2025-05-07T20:32:17.8849472Z     ) -> None:
2025-05-07T20:32:17.8849785Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8850128Z     
2025-05-07T20:32:17.8850507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8850969Z     
2025-05-07T20:32:17.8851253Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8851679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8852117Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8852473Z         x0 = x[:, :D]
2025-05-07T20:32:17.8852794Z         x1 = x[:, D:]
2025-05-07T20:32:17.8853087Z     
2025-05-07T20:32:17.8853353Z         if contiguous:
2025-05-07T20:32:17.8853684Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8854034Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8854381Z     
2025-05-07T20:32:17.8854660Z         if scale_ub is not None:
2025-05-07T20:32:17.8855046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8855521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8855961Z             )
2025-05-07T20:32:17.8856214Z         else:
2025-05-07T20:32:17.8856514Z             scale_ub_tensor = None
2025-05-07T20:32:17.8856890Z     
2025-05-07T20:32:17.8857229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8857667Z             op = silu_mul_quant
2025-05-07T20:32:17.8858138Z             if compiled:
2025-05-07T20:32:17.8858499Z                 op = torch.compile(op)
2025-05-07T20:32:17.8858922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8859322Z     
2025-05-07T20:32:17.8859601Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8859839Z 
2025-05-07T20:32:17.8859980Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8860399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8860883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8861407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8862358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8863420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8864175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8865104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8866021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8866762Z     kernel = self.compile(
2025-05-07T20:32:17.8867522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8868393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8868962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8869284Z 
2025-05-07T20:32:17.8869582Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7cc137d00>
2025-05-07T20:32:17.8871052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8873010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7cbe6e5e0>}
2025-05-07T20:32:17.8874884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8876220Z context = <triton._C.libtriton.ir.context object at 0x7fd7ecf54630>
2025-05-07T20:32:17.8876607Z 
2025-05-07T20:32:17.8876835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8877661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8878340Z                            module_map=module_map)
2025-05-07T20:32:17.8878858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8879356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8879747Z E       ^
2025-05-07T20:32:17.8880384Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8880849Z 
2025-05-07T20:32:17.8881289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8881813Z 
2025-05-07T20:32:17.8881923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8882353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8882771Z     T=2048,
2025-05-07T20:32:17.8882974Z     D=5120,
2025-05-07T20:32:17.8883191Z     scale_ub=1200.0,
2025-05-07T20:32:17.8883428Z     contiguous=True,
2025-05-07T20:32:17.8883660Z     compiled=True,
2025-05-07T20:32:17.8883882Z )
2025-05-07T20:32:17.8884216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8884727Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.8885072Z 
2025-05-07T20:32:17.8885155Z     @given(
2025-05-07T20:32:17.8885398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8885725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8886037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8886370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8886707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8887005Z     )
2025-05-07T20:32:17.8887367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8887823Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8888136Z         self,
2025-05-07T20:32:17.8888337Z         T: int,
2025-05-07T20:32:17.8888544Z         D: int,
2025-05-07T20:32:17.8888774Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8889049Z         contiguous: bool,
2025-05-07T20:32:17.8889298Z         compiled: bool,
2025-05-07T20:32:17.8889533Z     ) -> None:
2025-05-07T20:32:17.8889753Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8890004Z     
2025-05-07T20:32:17.8890285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8890630Z     
2025-05-07T20:32:17.8890832Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8891135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8891448Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8891701Z         x0 = x[:, :D]
2025-05-07T20:32:17.8891929Z         x1 = x[:, D:]
2025-05-07T20:32:17.8892140Z     
2025-05-07T20:32:17.8892339Z         if contiguous:
2025-05-07T20:32:17.8892594Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8892873Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8893123Z     
2025-05-07T20:32:17.8893328Z         if scale_ub is not None:
2025-05-07T20:32:17.8893612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8893956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8894282Z             )
2025-05-07T20:32:17.8894489Z         else:
2025-05-07T20:32:17.8894707Z             scale_ub_tensor = None
2025-05-07T20:32:17.8894971Z     
2025-05-07T20:32:17.8895215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8895539Z             op = silu_mul_quant
2025-05-07T20:32:17.8895809Z             if compiled:
2025-05-07T20:32:17.8896053Z                 op = torch.compile(op)
2025-05-07T20:32:17.8896356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8896638Z     
2025-05-07T20:32:17.8896842Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.8897208Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.8897517Z     
2025-05-07T20:32:17.8897755Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8898099Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.8898400Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.8898719Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.8899083Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8899400Z     
2025-05-07T20:32:17.8899611Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.8899806Z 
2025-05-07T20:32:17.8899908Z moe/activation_test.py:126: 
2025-05-07T20:32:17.8900212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8900558Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.8900890Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8901787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.8902549Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.8903156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8903840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8904604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.8905329Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8906084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.8906828Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8907565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.8908258Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.8908871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.8909390Z     fn()
2025-05-07T20:32:17.8909910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.8910499Z     self.fn.run(
2025-05-07T20:32:17.8910976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8911519Z     kernel = self.compile(
2025-05-07T20:32:17.8912077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8912740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8913187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8913430Z 
2025-05-07T20:32:17.8913640Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecb2c070>
2025-05-07T20:32:17.8914729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8916129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eca155e0>}
2025-05-07T20:32:17.8917472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8918495Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec849bb0>
2025-05-07T20:32:17.8918793Z 
2025-05-07T20:32:17.8919043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8919585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8920051Z                            module_map=module_map)
2025-05-07T20:32:17.8920435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8920806Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.8921087Z E       ^
2025-05-07T20:32:17.8921552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8922010Z 
2025-05-07T20:32:17.8922438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8922953Z 
2025-05-07T20:32:17.8923064Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8923481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8923894Z     T=16384,
2025-05-07T20:32:17.8924095Z     D=7168,
2025-05-07T20:32:17.8924295Z     scale_ub=1200.0,
2025-05-07T20:32:17.8924520Z     contiguous=False,
2025-05-07T20:32:17.8924766Z     compiled=False,
2025-05-07T20:32:17.8924980Z )
2025-05-07T20:32:17.8925298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8925886Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.8926170Z 
2025-05-07T20:32:17.8926259Z     @given(
2025-05-07T20:32:17.8926490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8926819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8927133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8927469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8927801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8928096Z     )
2025-05-07T20:32:17.8928462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8928963Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8929217Z         self,
2025-05-07T20:32:17.8929418Z         T: int,
2025-05-07T20:32:17.8929615Z         D: int,
2025-05-07T20:32:17.8929839Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8930121Z         contiguous: bool,
2025-05-07T20:32:17.8930361Z         compiled: bool,
2025-05-07T20:32:17.8930592Z     ) -> None:
2025-05-07T20:32:17.8930830Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8931076Z     
2025-05-07T20:32:17.8931355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8931709Z     
2025-05-07T20:32:17.8931900Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8932201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8932524Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8932770Z         x0 = x[:, :D]
2025-05-07T20:32:17.8932999Z         x1 = x[:, D:]
2025-05-07T20:32:17.8933235Z     
2025-05-07T20:32:17.8933447Z         if contiguous:
2025-05-07T20:32:17.8933691Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8933957Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8934198Z     
2025-05-07T20:32:17.8934394Z         if scale_ub is not None:
2025-05-07T20:32:17.8934673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8935014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8935332Z             )
2025-05-07T20:32:17.8935525Z         else:
2025-05-07T20:32:17.8935742Z             scale_ub_tensor = None
2025-05-07T20:32:17.8935993Z     
2025-05-07T20:32:17.8936226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8936549Z             op = silu_mul_quant
2025-05-07T20:32:17.8936801Z             if compiled:
2025-05-07T20:32:17.8937055Z                 op = torch.compile(op)
2025-05-07T20:32:17.8937359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8937638Z     
2025-05-07T20:32:17.8937924Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8938095Z 
2025-05-07T20:32:17.8938203Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8938499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8938843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8939140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8939850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8940833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8941442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8942129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8942794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8943345Z     kernel = self.compile(
2025-05-07T20:32:17.8943894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8944558Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8944960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8945297Z 
2025-05-07T20:32:17.8945509Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec97bcd0>
2025-05-07T20:32:17.8946598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8947972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec7e61f0>}
2025-05-07T20:32:17.8949376Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8950399Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec7e20f0>
2025-05-07T20:32:17.8950699Z 
2025-05-07T20:32:17.8950870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8951400Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8951870Z                            module_map=module_map)
2025-05-07T20:32:17.8952252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8952619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8952913Z E       ^
2025-05-07T20:32:17.8953400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8953860Z 
2025-05-07T20:32:17.8954282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8954804Z 
2025-05-07T20:32:17.8954917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8955331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8955749Z     T=1,
2025-05-07T20:32:17.8955940Z     D=7168,
2025-05-07T20:32:17.8956144Z     scale_ub=None,
2025-05-07T20:32:17.8956359Z     contiguous=True,
2025-05-07T20:32:17.8956596Z     compiled=True,
2025-05-07T20:32:17.8956808Z )
2025-05-07T20:32:17.8957129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8957623Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.8957884Z 
2025-05-07T20:32:17.8957971Z     @given(
2025-05-07T20:32:17.8958203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8958524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8959047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8959385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8959723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8960016Z     )
2025-05-07T20:32:17.8960373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8960821Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8961071Z         self,
2025-05-07T20:32:17.8961273Z         T: int,
2025-05-07T20:32:17.8961472Z         D: int,
2025-05-07T20:32:17.8961698Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8961984Z         contiguous: bool,
2025-05-07T20:32:17.8962225Z         compiled: bool,
2025-05-07T20:32:17.8962455Z     ) -> None:
2025-05-07T20:32:17.8962678Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8962923Z     
2025-05-07T20:32:17.8963202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8963555Z     
2025-05-07T20:32:17.8963755Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8964054Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8964375Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8964618Z         x0 = x[:, :D]
2025-05-07T20:32:17.8964842Z         x1 = x[:, D:]
2025-05-07T20:32:17.8965106Z     
2025-05-07T20:32:17.8965298Z         if contiguous:
2025-05-07T20:32:17.8965532Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8965800Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8966047Z     
2025-05-07T20:32:17.8966242Z         if scale_ub is not None:
2025-05-07T20:32:17.8966525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8966868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8967180Z             )
2025-05-07T20:32:17.8967380Z         else:
2025-05-07T20:32:17.8967599Z             scale_ub_tensor = None
2025-05-07T20:32:17.8967852Z     
2025-05-07T20:32:17.8968097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8968472Z             op = silu_mul_quant
2025-05-07T20:32:17.8968728Z             if compiled:
2025-05-07T20:32:17.8968985Z                 op = torch.compile(op)
2025-05-07T20:32:17.8969290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8969567Z     
2025-05-07T20:32:17.8969770Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.8970067Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.8970366Z     
2025-05-07T20:32:17.8970603Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8970947Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.8971252Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.8971575Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.8971944Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8972264Z     
2025-05-07T20:32:17.8972468Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.8972683Z 
2025-05-07T20:32:17.8972797Z moe/activation_test.py:126: 
2025-05-07T20:32:17.8973139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8973485Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.8973818Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.8974614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.8975372Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.8975920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8976609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8977304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.8978124Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8978989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.8979760Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.8980498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.8981209Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.8981820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.8982347Z     fn()
2025-05-07T20:32:17.8982861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.8983488Z     self.fn.run(
2025-05-07T20:32:17.8983965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8984513Z     kernel = self.compile(
2025-05-07T20:32:17.8985067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8985725Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8986223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8986460Z 
2025-05-07T20:32:17.8986681Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecf50040>
2025-05-07T20:32:17.8987761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8989124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec7e6790>}
2025-05-07T20:32:17.8990520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8991545Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1dd8f0>
2025-05-07T20:32:17.8991842Z 
2025-05-07T20:32:17.8992019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8992544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8993015Z                            module_map=module_map)
2025-05-07T20:32:17.8993391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8993757Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.8994026Z E       ^
2025-05-07T20:32:17.8994499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8994958Z 
2025-05-07T20:32:17.8995380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8995891Z 
2025-05-07T20:32:17.8995997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8996427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8996837Z     T=4096,
2025-05-07T20:32:17.8997034Z     D=5120,
2025-05-07T20:32:17.8997226Z     scale_ub=None,
2025-05-07T20:32:17.8997450Z     contiguous=False,
2025-05-07T20:32:17.8997689Z     compiled=False,
2025-05-07T20:32:17.8997896Z )
2025-05-07T20:32:17.8998228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8998730Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.8999005Z 
2025-05-07T20:32:17.8999086Z     @given(
2025-05-07T20:32:17.8999439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8999765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9000074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9000409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9000748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9001047Z     )
2025-05-07T20:32:17.9001397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9001845Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9002094Z         self,
2025-05-07T20:32:17.9002288Z         T: int,
2025-05-07T20:32:17.9002496Z         D: int,
2025-05-07T20:32:17.9002721Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9003043Z         contiguous: bool,
2025-05-07T20:32:17.9003293Z         compiled: bool,
2025-05-07T20:32:17.9003522Z     ) -> None:
2025-05-07T20:32:17.9003738Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9003987Z     
2025-05-07T20:32:17.9004275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9004618Z     
2025-05-07T20:32:17.9004819Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9005123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9005436Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9005740Z         x0 = x[:, :D]
2025-05-07T20:32:17.9005970Z         x1 = x[:, D:]
2025-05-07T20:32:17.9006186Z     
2025-05-07T20:32:17.9014489Z         if contiguous:
2025-05-07T20:32:17.9014770Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9015046Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9015301Z     
2025-05-07T20:32:17.9015502Z         if scale_ub is not None:
2025-05-07T20:32:17.9015789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9016136Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9016451Z             )
2025-05-07T20:32:17.9016659Z         else:
2025-05-07T20:32:17.9016893Z             scale_ub_tensor = None
2025-05-07T20:32:17.9017236Z     
2025-05-07T20:32:17.9017475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9017808Z             op = silu_mul_quant
2025-05-07T20:32:17.9018071Z             if compiled:
2025-05-07T20:32:17.9018326Z                 op = torch.compile(op)
2025-05-07T20:32:17.9018642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9018934Z     
2025-05-07T20:32:17.9019128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9019308Z 
2025-05-07T20:32:17.9019415Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9019728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9020065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9020362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9021159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9021897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9022449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9023148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9023819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9024358Z     kernel = self.compile(
2025-05-07T20:32:17.9024915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9025578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9025994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9026227Z 
2025-05-07T20:32:17.9026441Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ecb2c0a0>
2025-05-07T20:32:17.9027618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9029022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec3e9550>}
2025-05-07T20:32:17.9030384Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9031406Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1ff530>
2025-05-07T20:32:17.9031699Z 
2025-05-07T20:32:17.9031869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9032401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9032924Z                            module_map=module_map)
2025-05-07T20:32:17.9033317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9033684Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9033959Z E       ^
2025-05-07T20:32:17.9034431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9034927Z 
2025-05-07T20:32:17.9035345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9035876Z 
2025-05-07T20:32:17.9035985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9036420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9036833Z     T=4096,
2025-05-07T20:32:17.9037026Z     D=7168,
2025-05-07T20:32:17.9037232Z     scale_ub=None,
2025-05-07T20:32:17.9037460Z     contiguous=False,
2025-05-07T20:32:17.9037693Z     compiled=False,
2025-05-07T20:32:17.9037967Z )
2025-05-07T20:32:17.9038299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9038801Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9039084Z 
2025-05-07T20:32:17.9039170Z     @given(
2025-05-07T20:32:17.9039415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9039734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9040060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9040726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9041066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9041369Z     )
2025-05-07T20:32:17.9041726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9042186Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9042432Z         self,
2025-05-07T20:32:17.9042638Z         T: int,
2025-05-07T20:32:17.9042850Z         D: int,
2025-05-07T20:32:17.9043078Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9043363Z         contiguous: bool,
2025-05-07T20:32:17.9043610Z         compiled: bool,
2025-05-07T20:32:17.9043839Z     ) -> None:
2025-05-07T20:32:17.9044061Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9044314Z     
2025-05-07T20:32:17.9044599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9044950Z     
2025-05-07T20:32:17.9045155Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9045448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9045770Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9046018Z         x0 = x[:, :D]
2025-05-07T20:32:17.9046238Z         x1 = x[:, D:]
2025-05-07T20:32:17.9046454Z     
2025-05-07T20:32:17.9046650Z         if contiguous:
2025-05-07T20:32:17.9046890Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9047157Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9047413Z     
2025-05-07T20:32:17.9047799Z         if scale_ub is not None:
2025-05-07T20:32:17.9048081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9048434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9048755Z             )
2025-05-07T20:32:17.9048952Z         else:
2025-05-07T20:32:17.9049175Z             scale_ub_tensor = None
2025-05-07T20:32:17.9049441Z     
2025-05-07T20:32:17.9049677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9050008Z             op = silu_mul_quant
2025-05-07T20:32:17.9050271Z             if compiled:
2025-05-07T20:32:17.9050525Z                 op = torch.compile(op)
2025-05-07T20:32:17.9050831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9051117Z     
2025-05-07T20:32:17.9051310Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9051491Z 
2025-05-07T20:32:17.9051593Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9051904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9052258Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9052548Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9053254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9054025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9054561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9055259Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9055927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9056466Z     kernel = self.compile(
2025-05-07T20:32:17.9057015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9057685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9058164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9058397Z 
2025-05-07T20:32:17.9058614Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec97e6d0>
2025-05-07T20:32:17.9059699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9061059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ec29b5e0>}
2025-05-07T20:32:17.9062470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9063551Z context = <triton._C.libtriton.ir.context object at 0x7fd7ec1a1b30>
2025-05-07T20:32:17.9063843Z 
2025-05-07T20:32:17.9064022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9064551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9065024Z                            module_map=module_map)
2025-05-07T20:32:17.9065397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9065757Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9066018Z E       ^
2025-05-07T20:32:17.9066486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9066934Z 
2025-05-07T20:32:17.9067354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9067865Z 
2025-05-07T20:32:17.9067977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9068483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9068896Z     T=128,
2025-05-07T20:32:17.9069093Z     D=7168,
2025-05-07T20:32:17.9069285Z     scale_ub=None,
2025-05-07T20:32:17.9069510Z     contiguous=False,
2025-05-07T20:32:17.9069746Z     compiled=True,
2025-05-07T20:32:17.9069951Z )
2025-05-07T20:32:17.9070277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9070774Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9071042Z 
2025-05-07T20:32:17.9071127Z     @given(
2025-05-07T20:32:17.9071358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9071677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9071993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9072324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9072661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9072968Z     )
2025-05-07T20:32:17.9073324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9073777Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9074029Z         self,
2025-05-07T20:32:17.9074225Z         T: int,
2025-05-07T20:32:17.9074478Z         D: int,
2025-05-07T20:32:17.9074705Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9074979Z         contiguous: bool,
2025-05-07T20:32:17.9075226Z         compiled: bool,
2025-05-07T20:32:17.9075455Z     ) -> None:
2025-05-07T20:32:17.9075679Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9075921Z     
2025-05-07T20:32:17.9076198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9076549Z     
2025-05-07T20:32:17.9076743Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9077045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9077365Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9077658Z         x0 = x[:, :D]
2025-05-07T20:32:17.9077881Z         x1 = x[:, D:]
2025-05-07T20:32:17.9078096Z     
2025-05-07T20:32:17.9078282Z         if contiguous:
2025-05-07T20:32:17.9078525Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9078793Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9079037Z     
2025-05-07T20:32:17.9079241Z         if scale_ub is not None:
2025-05-07T20:32:17.9079523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9079860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9080179Z             )
2025-05-07T20:32:17.9080380Z         else:
2025-05-07T20:32:17.9080599Z             scale_ub_tensor = None
2025-05-07T20:32:17.9080853Z     
2025-05-07T20:32:17.9081093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9081417Z             op = silu_mul_quant
2025-05-07T20:32:17.9081672Z             if compiled:
2025-05-07T20:32:17.9081928Z                 op = torch.compile(op)
2025-05-07T20:32:17.9082245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9082523Z     
2025-05-07T20:32:17.9082721Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9083014Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9083306Z     
2025-05-07T20:32:17.9083550Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9083898Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9084192Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9084513Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9084885Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9085199Z     
2025-05-07T20:32:17.9085404Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9085613Z 
2025-05-07T20:32:17.9085715Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9086019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9086476Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9086819Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9087617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9088368Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9088926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9089612Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9090308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9091027Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9091783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9092542Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9093268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9093911Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9094564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9095089Z     fn()
2025-05-07T20:32:17.9095594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9096180Z     self.fn.run(
2025-05-07T20:32:17.9096658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9097183Z     kernel = self.compile(
2025-05-07T20:32:17.9097748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9098453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9098857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9099094Z 
2025-05-07T20:32:17.9099309Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eca5d220>
2025-05-07T20:32:17.9100400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9101868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ecc73af0>}
2025-05-07T20:32:17.9103209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9104246Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebbba5f0>
2025-05-07T20:32:17.9104538Z 
2025-05-07T20:32:17.9104708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9105241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9105709Z                            module_map=module_map)
2025-05-07T20:32:17.9106078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9106445Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9106722Z E       ^
2025-05-07T20:32:17.9107187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9107637Z 
2025-05-07T20:32:17.9108053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9108665Z 
2025-05-07T20:32:17.9108774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9109195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9109602Z     T=128,
2025-05-07T20:32:17.9109791Z     D=7168,
2025-05-07T20:32:17.9109994Z     scale_ub=None,
2025-05-07T20:32:17.9111196Z     contiguous=False,
2025-05-07T20:32:17.9111431Z     compiled=False,
2025-05-07T20:32:17.9111644Z )
2025-05-07T20:32:17.9111972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9112468Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9112747Z 
2025-05-07T20:32:17.9112829Z     @given(
2025-05-07T20:32:17.9113069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9113390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9113707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9114053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9114395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9114687Z     )
2025-05-07T20:32:17.9115044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9115493Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9115789Z         self,
2025-05-07T20:32:17.9115992Z         T: int,
2025-05-07T20:32:17.9116201Z         D: int,
2025-05-07T20:32:17.9116422Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9116703Z         contiguous: bool,
2025-05-07T20:32:17.9116952Z         compiled: bool,
2025-05-07T20:32:17.9117180Z     ) -> None:
2025-05-07T20:32:17.9117411Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9117666Z     
2025-05-07T20:32:17.9117947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9118299Z     
2025-05-07T20:32:17.9118502Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9118802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9119172Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9119420Z         x0 = x[:, :D]
2025-05-07T20:32:17.9119647Z         x1 = x[:, D:]
2025-05-07T20:32:17.9119854Z     
2025-05-07T20:32:17.9120045Z         if contiguous:
2025-05-07T20:32:17.9120281Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9120544Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9120792Z     
2025-05-07T20:32:17.9120991Z         if scale_ub is not None:
2025-05-07T20:32:17.9121266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9121612Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9121931Z             )
2025-05-07T20:32:17.9122124Z         else:
2025-05-07T20:32:17.9122342Z             scale_ub_tensor = None
2025-05-07T20:32:17.9122602Z     
2025-05-07T20:32:17.9122835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9123158Z             op = silu_mul_quant
2025-05-07T20:32:17.9123420Z             if compiled:
2025-05-07T20:32:17.9123678Z                 op = torch.compile(op)
2025-05-07T20:32:17.9123982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9124265Z     
2025-05-07T20:32:17.9124469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9124637Z 
2025-05-07T20:32:17.9124740Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9125047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9125389Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9125674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9126371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9127074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9127620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9128384Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9129065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9129695Z     kernel = self.compile(
2025-05-07T20:32:17.9130470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9131253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9131659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9131893Z 
2025-05-07T20:32:17.9132111Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec596f40>
2025-05-07T20:32:17.9133205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9134651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebf6f430>}
2025-05-07T20:32:17.9136000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9137102Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebb86470>
2025-05-07T20:32:17.9137397Z 
2025-05-07T20:32:17.9137573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9138103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9138575Z                            module_map=module_map)
2025-05-07T20:32:17.9138952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9139306Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9139628Z E       ^
2025-05-07T20:32:17.9140359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9140939Z 
2025-05-07T20:32:17.9141549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9142080Z 
2025-05-07T20:32:17.9142187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9142607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9143013Z     T=4096,
2025-05-07T20:32:17.9143202Z     D=5120,
2025-05-07T20:32:17.9143404Z     scale_ub=1200.0,
2025-05-07T20:32:17.9143636Z     contiguous=True,
2025-05-07T20:32:17.9143865Z     compiled=False,
2025-05-07T20:32:17.9144078Z )
2025-05-07T20:32:17.9144402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9144902Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9145197Z 
2025-05-07T20:32:17.9145278Z     @given(
2025-05-07T20:32:17.9145517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9145837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9146146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9146494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9146831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9147119Z     )
2025-05-07T20:32:17.9147482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9147931Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9148179Z         self,
2025-05-07T20:32:17.9148385Z         T: int,
2025-05-07T20:32:17.9148594Z         D: int,
2025-05-07T20:32:17.9148816Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9149100Z         contiguous: bool,
2025-05-07T20:32:17.9149351Z         compiled: bool,
2025-05-07T20:32:17.9149583Z     ) -> None:
2025-05-07T20:32:17.9150005Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9150262Z     
2025-05-07T20:32:17.9150545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9150890Z     
2025-05-07T20:32:17.9151094Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9151396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9151714Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9151967Z         x0 = x[:, :D]
2025-05-07T20:32:17.9152194Z         x1 = x[:, D:]
2025-05-07T20:32:17.9152405Z     
2025-05-07T20:32:17.9152599Z         if contiguous:
2025-05-07T20:32:17.9152841Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9153106Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9153358Z     
2025-05-07T20:32:17.9153562Z         if scale_ub is not None:
2025-05-07T20:32:17.9153840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9154192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9154513Z             )
2025-05-07T20:32:17.9154722Z         else:
2025-05-07T20:32:17.9154938Z             scale_ub_tensor = None
2025-05-07T20:32:17.9155201Z     
2025-05-07T20:32:17.9155439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9155761Z             op = silu_mul_quant
2025-05-07T20:32:17.9156095Z             if compiled:
2025-05-07T20:32:17.9156351Z                 op = torch.compile(op)
2025-05-07T20:32:17.9156651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9156936Z     
2025-05-07T20:32:17.9157136Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9157304Z 
2025-05-07T20:32:17.9157406Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9157709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9158049Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9158342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9159038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9159842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9160387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9161066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9161739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9162393Z     kernel = self.compile(
2025-05-07T20:32:17.9162979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9163653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9164060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9164295Z 
2025-05-07T20:32:17.9164522Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebec8820>
2025-05-07T20:32:17.9165618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9180262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebcb6430>}
2025-05-07T20:32:17.9181773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9182816Z context = <triton._C.libtriton.ir.context object at 0x7fd7ebb2cf70>
2025-05-07T20:32:17.9183115Z 
2025-05-07T20:32:17.9183300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9184657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9185162Z                            module_map=module_map)
2025-05-07T20:32:17.9185546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9185906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9186178Z E       ^
2025-05-07T20:32:17.9186659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9187114Z 
2025-05-07T20:32:17.9187539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9188064Z 
2025-05-07T20:32:17.9188170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9188596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9189005Z     T=1,
2025-05-07T20:32:17.9189190Z     D=5120,
2025-05-07T20:32:17.9189394Z     scale_ub=None,
2025-05-07T20:32:17.9189635Z     contiguous=True,
2025-05-07T20:32:17.9189863Z     compiled=True,
2025-05-07T20:32:17.9190080Z )
2025-05-07T20:32:17.9190409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9190899Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9191239Z 
2025-05-07T20:32:17.9191320Z     @given(
2025-05-07T20:32:17.9191561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9191889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9192200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9192544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9192884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9193178Z     )
2025-05-07T20:32:17.9193555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9194117Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9194427Z         self,
2025-05-07T20:32:17.9194636Z         T: int,
2025-05-07T20:32:17.9194842Z         D: int,
2025-05-07T20:32:17.9195063Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9195172Z         contiguous: bool,
2025-05-07T20:32:17.9195261Z         compiled: bool,
2025-05-07T20:32:17.9195353Z     ) -> None:
2025-05-07T20:32:17.9195456Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9195531Z     
2025-05-07T20:32:17.9195711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9195787Z     
2025-05-07T20:32:17.9195883Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9196018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9196111Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9196195Z         x0 = x[:, :D]
2025-05-07T20:32:17.9196284Z         x1 = x[:, D:]
2025-05-07T20:32:17.9196361Z     
2025-05-07T20:32:17.9196446Z         if contiguous:
2025-05-07T20:32:17.9196551Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9196655Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9196730Z     
2025-05-07T20:32:17.9196833Z         if scale_ub is not None:
2025-05-07T20:32:17.9196942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9197089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9197171Z             )
2025-05-07T20:32:17.9197252Z         else:
2025-05-07T20:32:17.9197355Z             scale_ub_tensor = None
2025-05-07T20:32:17.9197429Z     
2025-05-07T20:32:17.9197563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9197663Z             op = silu_mul_quant
2025-05-07T20:32:17.9197751Z             if compiled:
2025-05-07T20:32:17.9197855Z                 op = torch.compile(op)
2025-05-07T20:32:17.9197972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9198047Z     
2025-05-07T20:32:17.9198142Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9198273Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9198443Z     
2025-05-07T20:32:17.9198594Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9198700Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9198803Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9198937Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9199083Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9199162Z     
2025-05-07T20:32:17.9199273Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9199277Z 
2025-05-07T20:32:17.9199380Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9199519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9199627Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9199766Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9200351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9200459Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9200827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9201068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9201487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9201758Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9202162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9202421Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9202815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9203032Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9203385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9203468Z     fn()
2025-05-07T20:32:17.9203867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9203967Z     self.fn.run(
2025-05-07T20:32:17.9204324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9204461Z     kernel = self.compile(
2025-05-07T20:32:17.9205002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9205199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9205339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9205356Z 
2025-05-07T20:32:17.9205567Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebdb2880>
2025-05-07T20:32:17.9206352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9206874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ebcb6940>}
2025-05-07T20:32:17.9207617Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9207819Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb6a1570>
2025-05-07T20:32:17.9207824Z 
2025-05-07T20:32:17.9208095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9208372Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9208491Z                            module_map=module_map)
2025-05-07T20:32:17.9208655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9208770Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9208851Z E       ^
2025-05-07T20:32:17.9209209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9209214Z 
2025-05-07T20:32:17.9209635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9209639Z 
2025-05-07T20:32:17.9209744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9209976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9210058Z     T=2048,
2025-05-07T20:32:17.9210141Z     D=5120,
2025-05-07T20:32:17.9210240Z     scale_ub=None,
2025-05-07T20:32:17.9210330Z     contiguous=True,
2025-05-07T20:32:17.9210415Z     compiled=True,
2025-05-07T20:32:17.9210497Z )
2025-05-07T20:32:17.9210717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9210935Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9210940Z 
2025-05-07T20:32:17.9211026Z     @given(
2025-05-07T20:32:17.9211147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9211259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9211376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9211495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9211620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9211697Z     )
2025-05-07T20:32:17.9211951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9212130Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9212210Z         self,
2025-05-07T20:32:17.9212290Z         T: int,
2025-05-07T20:32:17.9212375Z         D: int,
2025-05-07T20:32:17.9212476Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9212566Z         contiguous: bool,
2025-05-07T20:32:17.9212664Z         compiled: bool,
2025-05-07T20:32:17.9212745Z     ) -> None:
2025-05-07T20:32:17.9212847Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9212922Z     
2025-05-07T20:32:17.9213094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9213185Z     
2025-05-07T20:32:17.9213280Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9213407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9213505Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9213587Z         x0 = x[:, :D]
2025-05-07T20:32:17.9213669Z         x1 = x[:, D:]
2025-05-07T20:32:17.9213751Z     
2025-05-07T20:32:17.9213847Z         if contiguous:
2025-05-07T20:32:17.9213942Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9214040Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9214114Z     
2025-05-07T20:32:17.9214207Z         if scale_ub is not None:
2025-05-07T20:32:17.9214322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9214464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9214549Z             )
2025-05-07T20:32:17.9214628Z         else:
2025-05-07T20:32:17.9214725Z             scale_ub_tensor = None
2025-05-07T20:32:17.9214809Z     
2025-05-07T20:32:17.9214943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9215037Z             op = silu_mul_quant
2025-05-07T20:32:17.9215132Z             if compiled:
2025-05-07T20:32:17.9215236Z                 op = torch.compile(op)
2025-05-07T20:32:17.9215344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9215427Z     
2025-05-07T20:32:17.9215522Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9215743Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9215830Z     
2025-05-07T20:32:17.9215971Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9216087Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9216190Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9216319Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9216470Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9216546Z     
2025-05-07T20:32:17.9216649Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9216654Z 
2025-05-07T20:32:17.9216762Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9216892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9217006Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9217145Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9217716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9217829Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9218192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9218472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9218854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9219117Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9219530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9219785Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9220216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9220393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9220745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9220836Z     fn()
2025-05-07T20:32:17.9221332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9221420Z     self.fn.run(
2025-05-07T20:32:17.9221760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9221864Z     kernel = self.compile(
2025-05-07T20:32:17.9222247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9222425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9222570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9222575Z 
2025-05-07T20:32:17.9222789Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ec06adf0>
2025-05-07T20:32:17.9223578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9224097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb6519d0>}
2025-05-07T20:32:17.9224860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9225138Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb71a370>
2025-05-07T20:32:17.9225147Z 
2025-05-07T20:32:17.9225326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9225599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9225709Z                            module_map=module_map)
2025-05-07T20:32:17.9225879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9225983Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9226062Z E       ^
2025-05-07T20:32:17.9226424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9226429Z 
2025-05-07T20:32:17.9226841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9226845Z 
2025-05-07T20:32:17.9226952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9227189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9227271Z     T=128,
2025-05-07T20:32:17.9227356Z     D=5120,
2025-05-07T20:32:17.9227441Z     scale_ub=None,
2025-05-07T20:32:17.9227528Z     contiguous=True,
2025-05-07T20:32:17.9227618Z     compiled=True,
2025-05-07T20:32:17.9227739Z )
2025-05-07T20:32:17.9227959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9228142Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9228148Z 
2025-05-07T20:32:17.9228226Z     @given(
2025-05-07T20:32:17.9228345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9228454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9228570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9228696Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9228811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9228932Z     )
2025-05-07T20:32:17.9229192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9229290Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9229369Z         self,
2025-05-07T20:32:17.9229454Z         T: int,
2025-05-07T20:32:17.9229531Z         D: int,
2025-05-07T20:32:17.9229633Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9229730Z         contiguous: bool,
2025-05-07T20:32:17.9229816Z         compiled: bool,
2025-05-07T20:32:17.9229897Z     ) -> None:
2025-05-07T20:32:17.9229999Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9230073Z     
2025-05-07T20:32:17.9230250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9230326Z     
2025-05-07T20:32:17.9230420Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9230554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9230645Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9230727Z         x0 = x[:, :D]
2025-05-07T20:32:17.9230825Z         x1 = x[:, D:]
2025-05-07T20:32:17.9230899Z     
2025-05-07T20:32:17.9230985Z         if contiguous:
2025-05-07T20:32:17.9231090Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9231180Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9231254Z     
2025-05-07T20:32:17.9231354Z         if scale_ub is not None:
2025-05-07T20:32:17.9231464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9231608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9231686Z             )
2025-05-07T20:32:17.9231766Z         else:
2025-05-07T20:32:17.9231877Z             scale_ub_tensor = None
2025-05-07T20:32:17.9231951Z     
2025-05-07T20:32:17.9232085Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9232184Z             op = silu_mul_quant
2025-05-07T20:32:17.9232271Z             if compiled:
2025-05-07T20:32:17.9232375Z                 op = torch.compile(op)
2025-05-07T20:32:17.9232488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9232725Z     
2025-05-07T20:32:17.9232821Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9232949Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9233023Z     
2025-05-07T20:32:17.9233166Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9233275Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9233376Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9233510Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9233657Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9233735Z     
2025-05-07T20:32:17.9233845Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9233849Z 
2025-05-07T20:32:17.9233950Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9234080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9234195Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9234342Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9234907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9235010Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9235422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9235658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9236027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9236290Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9236689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9236947Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9237372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9237541Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9237891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9237980Z     fn()
2025-05-07T20:32:17.9238387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9238480Z     self.fn.run(
2025-05-07T20:32:17.9238819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9238915Z     kernel = self.compile(
2025-05-07T20:32:17.9239305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9239493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9239624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9239638Z 
2025-05-07T20:32:17.9239848Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ebdf32e0>
2025-05-07T20:32:17.9240931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9241457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb9bb550>}
2025-05-07T20:32:17.9242431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9242665Z context = <triton._C.libtriton.ir.context object at 0x7fd7eb1b59b0>
2025-05-07T20:32:17.9242670Z 
2025-05-07T20:32:17.9242854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9243168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9243292Z                            module_map=module_map)
2025-05-07T20:32:17.9243470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9243590Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9243669Z E       ^
2025-05-07T20:32:17.9244100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9244105Z 
2025-05-07T20:32:17.9244614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9244621Z 
2025-05-07T20:32:17.9244739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9244998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9245084Z     T=4096,
2025-05-07T20:32:17.9245163Z     D=5120,
2025-05-07T20:32:17.9245258Z     scale_ub=None,
2025-05-07T20:32:17.9245436Z     contiguous=True,
2025-05-07T20:32:17.9245520Z     compiled=True,
2025-05-07T20:32:17.9245601Z )
2025-05-07T20:32:17.9245819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9245989Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9245994Z 
2025-05-07T20:32:17.9246081Z     @given(
2025-05-07T20:32:17.9246201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9246302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9246424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9246541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9246748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9246829Z     )
2025-05-07T20:32:17.9247079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9247184Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9247263Z         self,
2025-05-07T20:32:17.9247343Z         T: int,
2025-05-07T20:32:17.9247426Z         D: int,
2025-05-07T20:32:17.9247526Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9247617Z         contiguous: bool,
2025-05-07T20:32:17.9247711Z         compiled: bool,
2025-05-07T20:32:17.9247794Z     ) -> None:
2025-05-07T20:32:17.9247890Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9247972Z     
2025-05-07T20:32:17.9248141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9248223Z     
2025-05-07T20:32:17.9248316Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9248456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9248556Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9248647Z         x0 = x[:, :D]
2025-05-07T20:32:17.9248730Z         x1 = x[:, D:]
2025-05-07T20:32:17.9248804Z     
2025-05-07T20:32:17.9248897Z         if contiguous:
2025-05-07T20:32:17.9248989Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9249087Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9249165Z     
2025-05-07T20:32:17.9249257Z         if scale_ub is not None:
2025-05-07T20:32:17.9249375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9249512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9249589Z             )
2025-05-07T20:32:17.9249679Z         else:
2025-05-07T20:32:17.9249776Z             scale_ub_tensor = None
2025-05-07T20:32:17.9249853Z     
2025-05-07T20:32:17.9249992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9250085Z             op = silu_mul_quant
2025-05-07T20:32:17.9250172Z             if compiled:
2025-05-07T20:32:17.9250405Z                 op = torch.compile(op)
2025-05-07T20:32:17.9250515Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9250596Z     
2025-05-07T20:32:17.9250689Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9250813Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9250897Z     
2025-05-07T20:32:17.9251035Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9251140Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9251250Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9251372Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9251511Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9251594Z     
2025-05-07T20:32:17.9251695Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9251700Z 
2025-05-07T20:32:17.9251804Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9251938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9252049Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9252191Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9252767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9252957Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9253330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9253557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9253927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9254183Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9254584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9254890Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9255261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9255435Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9255774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9255854Z     fn()
2025-05-07T20:32:17.9256254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9256338Z     self.fn.run(
2025-05-07T20:32:17.9256674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9256774Z     kernel = self.compile(
2025-05-07T20:32:17.9257155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9257341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9257469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9257476Z 
2025-05-07T20:32:17.9257685Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb25e640>
2025-05-07T20:32:17.9258459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9258963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb907790>}
2025-05-07T20:32:17.9259786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9259990Z context = <triton._C.libtriton.ir.context object at 0x7fd7eac81a70>
2025-05-07T20:32:17.9259995Z 
2025-05-07T20:32:17.9260164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9260437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9260547Z                            module_map=module_map)
2025-05-07T20:32:17.9260716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9260819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9260897Z E       ^
2025-05-07T20:32:17.9261417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9261424Z 
2025-05-07T20:32:17.9261852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9261860Z 
2025-05-07T20:32:17.9261973Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9262197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9262276Z     T=16384,
2025-05-07T20:32:17.9262406Z     D=5120,
2025-05-07T20:32:17.9262489Z     scale_ub=None,
2025-05-07T20:32:17.9262574Z     contiguous=True,
2025-05-07T20:32:17.9262666Z     compiled=True,
2025-05-07T20:32:17.9262740Z )
2025-05-07T20:32:17.9262958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9263137Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9263142Z 
2025-05-07T20:32:17.9263222Z     @given(
2025-05-07T20:32:17.9263346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9263447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9263570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9263740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9263857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9263936Z     )
2025-05-07T20:32:17.9264199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9264297Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9264375Z         self,
2025-05-07T20:32:17.9264458Z         T: int,
2025-05-07T20:32:17.9264536Z         D: int,
2025-05-07T20:32:17.9264641Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9264733Z         contiguous: bool,
2025-05-07T20:32:17.9264821Z         compiled: bool,
2025-05-07T20:32:17.9264913Z     ) -> None:
2025-05-07T20:32:17.9265012Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9265087Z     
2025-05-07T20:32:17.9265264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9265340Z     
2025-05-07T20:32:17.9265434Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9265575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9265669Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9265753Z         x0 = x[:, :D]
2025-05-07T20:32:17.9265846Z         x1 = x[:, D:]
2025-05-07T20:32:17.9265922Z     
2025-05-07T20:32:17.9266009Z         if contiguous:
2025-05-07T20:32:17.9266113Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9266204Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9266284Z     
2025-05-07T20:32:17.9266377Z         if scale_ub is not None:
2025-05-07T20:32:17.9266484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9266628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9266705Z             )
2025-05-07T20:32:17.9266785Z         else:
2025-05-07T20:32:17.9266889Z             scale_ub_tensor = None
2025-05-07T20:32:17.9266965Z     
2025-05-07T20:32:17.9267101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9267289Z             op = silu_mul_quant
2025-05-07T20:32:17.9267379Z             if compiled:
2025-05-07T20:32:17.9267482Z                 op = torch.compile(op)
2025-05-07T20:32:17.9267596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9267670Z     
2025-05-07T20:32:17.9267772Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9267899Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9267974Z     
2025-05-07T20:32:17.9268119Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9268223Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9268324Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9268454Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9268602Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9268677Z     
2025-05-07T20:32:17.9268786Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9268791Z 
2025-05-07T20:32:17.9268897Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9269038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9269145Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9269280Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9269846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9269997Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9270358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9270591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9270955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9271222Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9271659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9271917Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9272301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9272469Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9272816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9272896Z     fn()
2025-05-07T20:32:17.9273291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9273381Z     self.fn.run(
2025-05-07T20:32:17.9273716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9273817Z     kernel = self.compile(
2025-05-07T20:32:17.9274201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9274383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9274519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9274524Z 
2025-05-07T20:32:17.9274732Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb2b9340>
2025-05-07T20:32:17.9275500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9276010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb259b80>}
2025-05-07T20:32:17.9276858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9277062Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea839430>
2025-05-07T20:32:17.9277069Z 
2025-05-07T20:32:17.9277237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9277507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9277616Z                            module_map=module_map)
2025-05-07T20:32:17.9277779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9277890Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9277968Z E       ^
2025-05-07T20:32:17.9278322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9278334Z 
2025-05-07T20:32:17.9278757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9278761Z 
2025-05-07T20:32:17.9278865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9279092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9279215Z     T=1,
2025-05-07T20:32:17.9279293Z     D=5120,
2025-05-07T20:32:17.9279386Z     scale_ub=1200.0,
2025-05-07T20:32:17.9279473Z     contiguous=True,
2025-05-07T20:32:17.9279556Z     compiled=True,
2025-05-07T20:32:17.9279637Z )
2025-05-07T20:32:17.9279854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9280021Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9280025Z 
2025-05-07T20:32:17.9280114Z     @given(
2025-05-07T20:32:17.9280233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9280390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9280510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9280630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9280751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9280826Z     )
2025-05-07T20:32:17.9281076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9281179Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9281259Z         self,
2025-05-07T20:32:17.9281340Z         T: int,
2025-05-07T20:32:17.9281425Z         D: int,
2025-05-07T20:32:17.9281524Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9281616Z         contiguous: bool,
2025-05-07T20:32:17.9281712Z         compiled: bool,
2025-05-07T20:32:17.9281793Z     ) -> None:
2025-05-07T20:32:17.9281896Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9281972Z     
2025-05-07T20:32:17.9282145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9282232Z     
2025-05-07T20:32:17.9282327Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9282455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9282553Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9282641Z         x0 = x[:, :D]
2025-05-07T20:32:17.9282725Z         x1 = x[:, D:]
2025-05-07T20:32:17.9282807Z     
2025-05-07T20:32:17.9282894Z         if contiguous:
2025-05-07T20:32:17.9282987Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9283087Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9283163Z     
2025-05-07T20:32:17.9283266Z         if scale_ub is not None:
2025-05-07T20:32:17.9283374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9283512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9283599Z             )
2025-05-07T20:32:17.9283679Z         else:
2025-05-07T20:32:17.9283776Z             scale_ub_tensor = None
2025-05-07T20:32:17.9283857Z     
2025-05-07T20:32:17.9284079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9284174Z             op = silu_mul_quant
2025-05-07T20:32:17.9284271Z             if compiled:
2025-05-07T20:32:17.9284374Z                 op = torch.compile(op)
2025-05-07T20:32:17.9284482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9284566Z     
2025-05-07T20:32:17.9284664Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9284669Z 
2025-05-07T20:32:17.9284776Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9284905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9285008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9285118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9285489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9285583Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9286090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9286191Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9286555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9286827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9287164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9287269Z     kernel = self.compile(
2025-05-07T20:32:17.9287649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9287825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9287960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9287964Z 
2025-05-07T20:32:17.9288176Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb6c66a0>
2025-05-07T20:32:17.9289016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9289528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb831e50>}
2025-05-07T20:32:17.9290272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9290464Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea6f18b0>
2025-05-07T20:32:17.9290469Z 
2025-05-07T20:32:17.9290636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9290911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9291024Z                            module_map=module_map)
2025-05-07T20:32:17.9291193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9291292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9291374Z E       ^
2025-05-07T20:32:17.9291734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9291739Z 
2025-05-07T20:32:17.9292148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9292152Z 
2025-05-07T20:32:17.9292258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9292488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9292568Z     T=1,
2025-05-07T20:32:17.9292652Z     D=5120,
2025-05-07T20:32:17.9292817Z     scale_ub=None,
2025-05-07T20:32:17.9292908Z     contiguous=False,
2025-05-07T20:32:17.9293000Z     compiled=True,
2025-05-07T20:32:17.9293078Z )
2025-05-07T20:32:17.9293296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9293469Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9293476Z 
2025-05-07T20:32:17.9293554Z     @given(
2025-05-07T20:32:17.9293675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9293784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9293899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9294025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9294140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9294216Z     )
2025-05-07T20:32:17.9294470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9294568Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9294659Z         self,
2025-05-07T20:32:17.9294750Z         T: int,
2025-05-07T20:32:17.9294830Z         D: int,
2025-05-07T20:32:17.9294931Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9295033Z         contiguous: bool,
2025-05-07T20:32:17.9295122Z         compiled: bool,
2025-05-07T20:32:17.9295249Z     ) -> None:
2025-05-07T20:32:17.9295354Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9295430Z     
2025-05-07T20:32:17.9295610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9295687Z     
2025-05-07T20:32:17.9295783Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9295919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9296010Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9296093Z         x0 = x[:, :D]
2025-05-07T20:32:17.9296184Z         x1 = x[:, D:]
2025-05-07T20:32:17.9296258Z     
2025-05-07T20:32:17.9296345Z         if contiguous:
2025-05-07T20:32:17.9296449Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9296593Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9296668Z     
2025-05-07T20:32:17.9296772Z         if scale_ub is not None:
2025-05-07T20:32:17.9296880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9297028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9297112Z             )
2025-05-07T20:32:17.9297194Z         else:
2025-05-07T20:32:17.9297299Z             scale_ub_tensor = None
2025-05-07T20:32:17.9297374Z     
2025-05-07T20:32:17.9297506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9297606Z             op = silu_mul_quant
2025-05-07T20:32:17.9297694Z             if compiled:
2025-05-07T20:32:17.9297798Z                 op = torch.compile(op)
2025-05-07T20:32:17.9297918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9297994Z     
2025-05-07T20:32:17.9298087Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9298223Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9298306Z     
2025-05-07T20:32:17.9298456Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9298560Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9298662Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9298796Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9298943Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9299019Z     
2025-05-07T20:32:17.9299127Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9299132Z 
2025-05-07T20:32:17.9299230Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9299360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9299476Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9299611Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9300261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9300373Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9300733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9300965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9301455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9301729Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9302127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9302380Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9302766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9302936Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9303284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9303369Z     fn()
2025-05-07T20:32:17.9303814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9303903Z     self.fn.run(
2025-05-07T20:32:17.9304238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9304333Z     kernel = self.compile(
2025-05-07T20:32:17.9304716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9304899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9305035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9305089Z 
2025-05-07T20:32:17.9305299Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb8d72e0>
2025-05-07T20:32:17.9306068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9306590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb34bee0>}
2025-05-07T20:32:17.9307327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9307527Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea2ae4b0>
2025-05-07T20:32:17.9307532Z 
2025-05-07T20:32:17.9307707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9307972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9308088Z                            module_map=module_map)
2025-05-07T20:32:17.9308251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9308365Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9308444Z E       ^
2025-05-07T20:32:17.9308796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9308800Z 
2025-05-07T20:32:17.9309220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9309225Z 
2025-05-07T20:32:17.9309329Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9309553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9309754Z     T=1,
2025-05-07T20:32:17.9309834Z     D=5120,
2025-05-07T20:32:17.9309923Z     scale_ub=None,
2025-05-07T20:32:17.9310009Z     contiguous=True,
2025-05-07T20:32:17.9310096Z     compiled=False,
2025-05-07T20:32:17.9310178Z )
2025-05-07T20:32:17.9310395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9310566Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9310571Z 
2025-05-07T20:32:17.9310656Z     @given(
2025-05-07T20:32:17.9310776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9310877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9311003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9311121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9311241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9311320Z     )
2025-05-07T20:32:17.9311577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9311682Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9311761Z         self,
2025-05-07T20:32:17.9311839Z         T: int,
2025-05-07T20:32:17.9311922Z         D: int,
2025-05-07T20:32:17.9312022Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9312155Z         contiguous: bool,
2025-05-07T20:32:17.9312261Z         compiled: bool,
2025-05-07T20:32:17.9318166Z     ) -> None:
2025-05-07T20:32:17.9318286Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9318364Z     
2025-05-07T20:32:17.9318552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9318631Z     
2025-05-07T20:32:17.9318730Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9318868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9318962Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9319047Z         x0 = x[:, :D]
2025-05-07T20:32:17.9319140Z         x1 = x[:, D:]
2025-05-07T20:32:17.9319299Z     
2025-05-07T20:32:17.9319405Z         if contiguous:
2025-05-07T20:32:17.9319502Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9319595Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9319680Z     
2025-05-07T20:32:17.9319775Z         if scale_ub is not None:
2025-05-07T20:32:17.9319883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9320039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9320120Z             )
2025-05-07T20:32:17.9320202Z         else:
2025-05-07T20:32:17.9320312Z             scale_ub_tensor = None
2025-05-07T20:32:17.9320387Z     
2025-05-07T20:32:17.9320526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9320629Z             op = silu_mul_quant
2025-05-07T20:32:17.9320718Z             if compiled:
2025-05-07T20:32:17.9320825Z                 op = torch.compile(op)
2025-05-07T20:32:17.9320944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9321021Z     
2025-05-07T20:32:17.9321131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9321136Z 
2025-05-07T20:32:17.9321241Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9321374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9321489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9321594Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9322109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9322220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9322580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9322820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9323162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9323260Z     kernel = self.compile(
2025-05-07T20:32:17.9323751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9323932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9324062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9324076Z 
2025-05-07T20:32:17.9324286Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea674c40>
2025-05-07T20:32:17.9325058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9325583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead64dc0>}
2025-05-07T20:32:17.9326337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9326546Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea20f630>
2025-05-07T20:32:17.9326551Z 
2025-05-07T20:32:17.9326762Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9327033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9327155Z                            module_map=module_map)
2025-05-07T20:32:17.9327319Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9327426Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9327506Z E       ^
2025-05-07T20:32:17.9327863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9327868Z 
2025-05-07T20:32:17.9328336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9328340Z 
2025-05-07T20:32:17.9328445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9328670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9328760Z     T=128,
2025-05-07T20:32:17.9328841Z     D=5120,
2025-05-07T20:32:17.9328933Z     scale_ub=None,
2025-05-07T20:32:17.9329022Z     contiguous=False,
2025-05-07T20:32:17.9329106Z     compiled=True,
2025-05-07T20:32:17.9329189Z )
2025-05-07T20:32:17.9329410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9329584Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9329589Z 
2025-05-07T20:32:17.9329676Z     @given(
2025-05-07T20:32:17.9329796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9329898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9330034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9330153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9330281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9330356Z     )
2025-05-07T20:32:17.9330605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9330712Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9330794Z         self,
2025-05-07T20:32:17.9330872Z         T: int,
2025-05-07T20:32:17.9330958Z         D: int,
2025-05-07T20:32:17.9331063Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9331154Z         contiguous: bool,
2025-05-07T20:32:17.9331251Z         compiled: bool,
2025-05-07T20:32:17.9331335Z     ) -> None:
2025-05-07T20:32:17.9331432Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9331514Z     
2025-05-07T20:32:17.9331685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9331771Z     
2025-05-07T20:32:17.9331951Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9332081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9332179Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9332262Z         x0 = x[:, :D]
2025-05-07T20:32:17.9332344Z         x1 = x[:, D:]
2025-05-07T20:32:17.9332429Z     
2025-05-07T20:32:17.9332515Z         if contiguous:
2025-05-07T20:32:17.9332609Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9332710Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9332804Z     
2025-05-07T20:32:17.9332904Z         if scale_ub is not None:
2025-05-07T20:32:17.9333046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9333188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9333280Z             )
2025-05-07T20:32:17.9333362Z         else:
2025-05-07T20:32:17.9333458Z             scale_ub_tensor = None
2025-05-07T20:32:17.9333547Z     
2025-05-07T20:32:17.9333681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9333788Z             op = silu_mul_quant
2025-05-07T20:32:17.9333888Z             if compiled:
2025-05-07T20:32:17.9333991Z                 op = torch.compile(op)
2025-05-07T20:32:17.9334099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9334185Z     
2025-05-07T20:32:17.9334322Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9334327Z 
2025-05-07T20:32:17.9334427Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9334567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9334672Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9334786Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9335154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9335252Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9335761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9335909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9336268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9336507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9336851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9336958Z     kernel = self.compile(
2025-05-07T20:32:17.9337340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9337523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9337665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9337669Z 
2025-05-07T20:32:17.9337879Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea65c700>
2025-05-07T20:32:17.9338670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9339187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead64670>}
2025-05-07T20:32:17.9339942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9340489Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea0d18f0>
2025-05-07T20:32:17.9340497Z 
2025-05-07T20:32:17.9340739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9341301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9341421Z                            module_map=module_map)
2025-05-07T20:32:17.9341587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9341700Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9341779Z E       ^
2025-05-07T20:32:17.9342157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9342162Z 
2025-05-07T20:32:17.9342575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9342580Z 
2025-05-07T20:32:17.9342685Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9342924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9343023Z     T=128,
2025-05-07T20:32:17.9343105Z     D=7168,
2025-05-07T20:32:17.9343218Z     scale_ub=1200.0,
2025-05-07T20:32:17.9343319Z     contiguous=False,
2025-05-07T20:32:17.9343416Z     compiled=False,
2025-05-07T20:32:17.9343492Z )
2025-05-07T20:32:17.9343712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9343894Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9343966Z 
2025-05-07T20:32:17.9344047Z     @given(
2025-05-07T20:32:17.9344168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9344278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9344395Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9344516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9344643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9344719Z     )
2025-05-07T20:32:17.9344975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9345071Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9345151Z         self,
2025-05-07T20:32:17.9345354Z         T: int,
2025-05-07T20:32:17.9345435Z         D: int,
2025-05-07T20:32:17.9345535Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9345634Z         contiguous: bool,
2025-05-07T20:32:17.9345722Z         compiled: bool,
2025-05-07T20:32:17.9345802Z     ) -> None:
2025-05-07T20:32:17.9345911Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9345986Z     
2025-05-07T20:32:17.9346159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9346242Z     
2025-05-07T20:32:17.9346337Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9346471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9346561Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9346645Z         x0 = x[:, :D]
2025-05-07T20:32:17.9346737Z         x1 = x[:, D:]
2025-05-07T20:32:17.9346811Z     
2025-05-07T20:32:17.9346897Z         if contiguous:
2025-05-07T20:32:17.9346997Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9347099Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9347173Z     
2025-05-07T20:32:17.9347275Z         if scale_ub is not None:
2025-05-07T20:32:17.9347383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9347521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9347606Z             )
2025-05-07T20:32:17.9347688Z         else:
2025-05-07T20:32:17.9347793Z             scale_ub_tensor = None
2025-05-07T20:32:17.9347867Z     
2025-05-07T20:32:17.9347998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9348101Z             op = silu_mul_quant
2025-05-07T20:32:17.9348190Z             if compiled:
2025-05-07T20:32:17.9348293Z                 op = torch.compile(op)
2025-05-07T20:32:17.9348409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9348490Z     
2025-05-07T20:32:17.9348584Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9348588Z 
2025-05-07T20:32:17.9348700Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9348916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9349023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9349134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9349633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9349745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9350108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9350334Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9350682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9350780Z     kernel = self.compile(
2025-05-07T20:32:17.9351171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9351360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9351489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9351493Z 
2025-05-07T20:32:17.9351720Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea60f2b0>
2025-05-07T20:32:17.9352550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9353098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea920430>}
2025-05-07T20:32:17.9353884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9354120Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea5da570>
2025-05-07T20:32:17.9354125Z 
2025-05-07T20:32:17.9354305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9354572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9354695Z                            module_map=module_map)
2025-05-07T20:32:17.9354858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9354960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9355050Z E       ^
2025-05-07T20:32:17.9355413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9355418Z 
2025-05-07T20:32:17.9355847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9355854Z 
2025-05-07T20:32:17.9355965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9356197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9356286Z     T=128,
2025-05-07T20:32:17.9356373Z     D=5120,
2025-05-07T20:32:17.9356456Z     scale_ub=None,
2025-05-07T20:32:17.9356549Z     contiguous=False,
2025-05-07T20:32:17.9356647Z     compiled=False,
2025-05-07T20:32:17.9356721Z )
2025-05-07T20:32:17.9356939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9357121Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9357125Z 
2025-05-07T20:32:17.9357205Z     @given(
2025-05-07T20:32:17.9357332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9357434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9357551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9357677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9357874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9357952Z     )
2025-05-07T20:32:17.9358208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9358302Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9358388Z         self,
2025-05-07T20:32:17.9358469Z         T: int,
2025-05-07T20:32:17.9358549Z         D: int,
2025-05-07T20:32:17.9358658Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9358750Z         contiguous: bool,
2025-05-07T20:32:17.9358837Z         compiled: bool,
2025-05-07T20:32:17.9358926Z     ) -> None:
2025-05-07T20:32:17.9359022Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9359098Z     
2025-05-07T20:32:17.9359278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9359356Z     
2025-05-07T20:32:17.9359449Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9359582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9359682Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9359765Z         x0 = x[:, :D]
2025-05-07T20:32:17.9359854Z         x1 = x[:, D:]
2025-05-07T20:32:17.9359931Z     
2025-05-07T20:32:17.9360026Z         if contiguous:
2025-05-07T20:32:17.9360119Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9360255Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9360340Z     
2025-05-07T20:32:17.9360432Z         if scale_ub is not None:
2025-05-07T20:32:17.9360539Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9360685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9360762Z             )
2025-05-07T20:32:17.9360840Z         else:
2025-05-07T20:32:17.9360942Z             scale_ub_tensor = None
2025-05-07T20:32:17.9361016Z     
2025-05-07T20:32:17.9361148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9361246Z             op = silu_mul_quant
2025-05-07T20:32:17.9361332Z             if compiled:
2025-05-07T20:32:17.9361489Z                 op = torch.compile(op)
2025-05-07T20:32:17.9361597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9361670Z     
2025-05-07T20:32:17.9361770Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9361774Z 
2025-05-07T20:32:17.9361872Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9362004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9362112Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9362214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9362718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9362825Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9363184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9363416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9363767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9363863Z     kernel = self.compile(
2025-05-07T20:32:17.9364251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9364434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9364570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9364574Z 
2025-05-07T20:32:17.9364783Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7eb8c34f0>
2025-05-07T20:32:17.9365557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9366146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ead645e0>}
2025-05-07T20:32:17.9366890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9367094Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea639a70>
2025-05-07T20:32:17.9367098Z 
2025-05-07T20:32:17.9367266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9367536Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9367653Z                            module_map=module_map)
2025-05-07T20:32:17.9367815Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9367921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9367998Z E       ^
2025-05-07T20:32:17.9368359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9368363Z 
2025-05-07T20:32:17.9368788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9368833Z 
2025-05-07T20:32:17.9368940Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9369173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9369253Z     T=128,
2025-05-07T20:32:17.9369332Z     D=5120,
2025-05-07T20:32:17.9369423Z     scale_ub=1200.0,
2025-05-07T20:32:17.9369510Z     contiguous=True,
2025-05-07T20:32:17.9369595Z     compiled=False,
2025-05-07T20:32:17.9369677Z )
2025-05-07T20:32:17.9369898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9370070Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9370190Z 
2025-05-07T20:32:17.9370283Z     @given(
2025-05-07T20:32:17.9370403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9370516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9370635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9370754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9370878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9370955Z     )
2025-05-07T20:32:17.9371203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9371304Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9371384Z         self,
2025-05-07T20:32:17.9371462Z         T: int,
2025-05-07T20:32:17.9371546Z         D: int,
2025-05-07T20:32:17.9371645Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9371737Z         contiguous: bool,
2025-05-07T20:32:17.9371830Z         compiled: bool,
2025-05-07T20:32:17.9371911Z     ) -> None:
2025-05-07T20:32:17.9372023Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9372103Z     
2025-05-07T20:32:17.9372278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9372361Z     
2025-05-07T20:32:17.9372456Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9372582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9372684Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9372767Z         x0 = x[:, :D]
2025-05-07T20:32:17.9372850Z         x1 = x[:, D:]
2025-05-07T20:32:17.9372932Z     
2025-05-07T20:32:17.9373019Z         if contiguous:
2025-05-07T20:32:17.9373115Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9373213Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9373286Z     
2025-05-07T20:32:17.9373380Z         if scale_ub is not None:
2025-05-07T20:32:17.9373497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9373634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9373721Z             )
2025-05-07T20:32:17.9373888Z         else:
2025-05-07T20:32:17.9373989Z             scale_ub_tensor = None
2025-05-07T20:32:17.9374071Z     
2025-05-07T20:32:17.9374205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9374296Z             op = silu_mul_quant
2025-05-07T20:32:17.9374390Z             if compiled:
2025-05-07T20:32:17.9374494Z                 op = torch.compile(op)
2025-05-07T20:32:17.9374603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9374685Z     
2025-05-07T20:32:17.9374780Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9374784Z 
2025-05-07T20:32:17.9374893Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9375021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9375124Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9375231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9375741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9375845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9376213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9376436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9376855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9376951Z     kernel = self.compile(
2025-05-07T20:32:17.9377331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9377513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9377639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9377644Z 
2025-05-07T20:32:17.9377851Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea15fd90>
2025-05-07T20:32:17.9378688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9379201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7eb2c3b80>}
2025-05-07T20:32:17.9379947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9380140Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea697070>
2025-05-07T20:32:17.9380145Z 
2025-05-07T20:32:17.9380317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9380587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9380698Z                            module_map=module_map)
2025-05-07T20:32:17.9380884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9380984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9381203Z E       ^
2025-05-07T20:32:17.9381573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9381577Z 
2025-05-07T20:32:17.9381991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9381996Z 
2025-05-07T20:32:17.9382109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9382334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9382412Z     T=1,
2025-05-07T20:32:17.9382499Z     D=7168,
2025-05-07T20:32:17.9382583Z     scale_ub=1200.0,
2025-05-07T20:32:17.9382672Z     contiguous=True,
2025-05-07T20:32:17.9382851Z     compiled=True,
2025-05-07T20:32:17.9382930Z )
2025-05-07T20:32:17.9383150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9383324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9383332Z 
2025-05-07T20:32:17.9383412Z     @given(
2025-05-07T20:32:17.9383532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9383642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9383757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9383887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9384001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9384077Z     )
2025-05-07T20:32:17.9384329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9384425Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9384504Z         self,
2025-05-07T20:32:17.9384598Z         T: int,
2025-05-07T20:32:17.9384679Z         D: int,
2025-05-07T20:32:17.9384779Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9384876Z         contiguous: bool,
2025-05-07T20:32:17.9384965Z         compiled: bool,
2025-05-07T20:32:17.9385045Z     ) -> None:
2025-05-07T20:32:17.9385150Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9385265Z     
2025-05-07T20:32:17.9385444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9385520Z     
2025-05-07T20:32:17.9385614Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9385747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9385838Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9385922Z         x0 = x[:, :D]
2025-05-07T20:32:17.9386009Z         x1 = x[:, D:]
2025-05-07T20:32:17.9386082Z     
2025-05-07T20:32:17.9386167Z         if contiguous:
2025-05-07T20:32:17.9386266Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9386357Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9386481Z     
2025-05-07T20:32:17.9386584Z         if scale_ub is not None:
2025-05-07T20:32:17.9386696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9386849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9386927Z             )
2025-05-07T20:32:17.9387009Z         else:
2025-05-07T20:32:17.9387113Z             scale_ub_tensor = None
2025-05-07T20:32:17.9387189Z     
2025-05-07T20:32:17.9387330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9387431Z             op = silu_mul_quant
2025-05-07T20:32:17.9387519Z             if compiled:
2025-05-07T20:32:17.9387624Z                 op = torch.compile(op)
2025-05-07T20:32:17.9387741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9387816Z     
2025-05-07T20:32:17.9387912Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9387923Z 
2025-05-07T20:32:17.9388026Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9388173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9388292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9388398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9388839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9388946Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9389550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9389654Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9390092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9390351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9390944Z     kernel = self.compile(
2025-05-07T20:32:17.9391332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9391515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9391645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9391650Z 
2025-05-07T20:32:17.9391866Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea69ba30>
2025-05-07T20:32:17.9392640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9393147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea58e820>}
2025-05-07T20:32:17.9393902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9394097Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea5604f0>
2025-05-07T20:32:17.9394179Z 
2025-05-07T20:32:17.9394357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9394620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9394731Z                            module_map=module_map)
2025-05-07T20:32:17.9394901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9395001Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9395087Z E       ^
2025-05-07T20:32:17.9395441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9395484Z 
2025-05-07T20:32:17.9395911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9395916Z 
2025-05-07T20:32:17.9396030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9396254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9396344Z     T=1,
2025-05-07T20:32:17.9396423Z     D=7168,
2025-05-07T20:32:17.9396507Z     scale_ub=1200.0,
2025-05-07T20:32:17.9396604Z     contiguous=False,
2025-05-07T20:32:17.9396689Z     compiled=True,
2025-05-07T20:32:17.9396764Z )
2025-05-07T20:32:17.9396991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9397158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9397163Z 
2025-05-07T20:32:17.9397242Z     @given(
2025-05-07T20:32:17.9397370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9397480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9397597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9397723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9397839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9397926Z     )
2025-05-07T20:32:17.9398178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9398274Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9398360Z         self,
2025-05-07T20:32:17.9398442Z         T: int,
2025-05-07T20:32:17.9398521Z         D: int,
2025-05-07T20:32:17.9398631Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9398724Z         contiguous: bool,
2025-05-07T20:32:17.9398815Z         compiled: bool,
2025-05-07T20:32:17.9398904Z     ) -> None:
2025-05-07T20:32:17.9399002Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9399077Z     
2025-05-07T20:32:17.9399257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9399415Z     
2025-05-07T20:32:17.9399517Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9399644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9399736Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9399831Z         x0 = x[:, :D]
2025-05-07T20:32:17.9399913Z         x1 = x[:, D:]
2025-05-07T20:32:17.9399993Z     
2025-05-07T20:32:17.9400088Z         if contiguous:
2025-05-07T20:32:17.9400179Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9400269Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9400351Z     
2025-05-07T20:32:17.9400445Z         if scale_ub is not None:
2025-05-07T20:32:17.9400555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9400702Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9400784Z             )
2025-05-07T20:32:17.9400869Z         else:
2025-05-07T20:32:17.9400965Z             scale_ub_tensor = None
2025-05-07T20:32:17.9401039Z     
2025-05-07T20:32:17.9401193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9401286Z             op = silu_mul_quant
2025-05-07T20:32:17.9401373Z             if compiled:
2025-05-07T20:32:17.9401481Z                 op = torch.compile(op)
2025-05-07T20:32:17.9401589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9401709Z     
2025-05-07T20:32:17.9401812Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9401816Z 
2025-05-07T20:32:17.9401915Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9402044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9402153Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9402256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9402630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9402723Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9403222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9403372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9403734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9403963Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9404306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9404401Z     kernel = self.compile(
2025-05-07T20:32:17.9404788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9404969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9405097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9405101Z 
2025-05-07T20:32:17.9405321Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea563d30>
2025-05-07T20:32:17.9406095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9406616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea4944c0>}
2025-05-07T20:32:17.9407360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9407561Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4bb970>
2025-05-07T20:32:17.9407566Z 
2025-05-07T20:32:17.9407732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9408107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9408228Z                            module_map=module_map)
2025-05-07T20:32:17.9408389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9408490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9408576Z E       ^
2025-05-07T20:32:17.9408928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9408933Z 
2025-05-07T20:32:17.9409356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9409360Z 
2025-05-07T20:32:17.9409465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9409689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9409778Z     T=1,
2025-05-07T20:32:17.9409855Z     D=7168,
2025-05-07T20:32:17.9409938Z     scale_ub=None,
2025-05-07T20:32:17.9410040Z     contiguous=False,
2025-05-07T20:32:17.9410126Z     compiled=True,
2025-05-07T20:32:17.9410207Z )
2025-05-07T20:32:17.9410424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9410590Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9410636Z 
2025-05-07T20:32:17.9410726Z     @given(
2025-05-07T20:32:17.9410847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9410949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9411074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9411193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9411306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9411388Z     )
2025-05-07T20:32:17.9411638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9411743Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9411869Z         self,
2025-05-07T20:32:17.9411951Z         T: int,
2025-05-07T20:32:17.9412038Z         D: int,
2025-05-07T20:32:17.9412140Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9412231Z         contiguous: bool,
2025-05-07T20:32:17.9412326Z         compiled: bool,
2025-05-07T20:32:17.9412407Z     ) -> None:
2025-05-07T20:32:17.9412506Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9412587Z     
2025-05-07T20:32:17.9412761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9412838Z     
2025-05-07T20:32:17.9412938Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9413066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9413165Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9413248Z         x0 = x[:, :D]
2025-05-07T20:32:17.9413333Z         x1 = x[:, D:]
2025-05-07T20:32:17.9413426Z     
2025-05-07T20:32:17.9413524Z         if contiguous:
2025-05-07T20:32:17.9413637Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9413748Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9413820Z     
2025-05-07T20:32:17.9413913Z         if scale_ub is not None:
2025-05-07T20:32:17.9414028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9414165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9414249Z             )
2025-05-07T20:32:17.9414336Z         else:
2025-05-07T20:32:17.9414434Z             scale_ub_tensor = None
2025-05-07T20:32:17.9414520Z     
2025-05-07T20:32:17.9414652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9414745Z             op = silu_mul_quant
2025-05-07T20:32:17.9414838Z             if compiled:
2025-05-07T20:32:17.9414939Z                 op = torch.compile(op)
2025-05-07T20:32:17.9415046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9415129Z     
2025-05-07T20:32:17.9415222Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.9415346Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.9415508Z     
2025-05-07T20:32:17.9415649Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9415753Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.9415862Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.9415985Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.9416137Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9416218Z     
2025-05-07T20:32:17.9416321Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.9416325Z 
2025-05-07T20:32:17.9416433Z moe/activation_test.py:126: 
2025-05-07T20:32:17.9416561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9416668Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.9416809Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.9417379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.9417488Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.9417848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9418072Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9418495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.9418751Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9419147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.9419407Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.9419790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.9420017Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.9420358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.9420437Z     fn()
2025-05-07T20:32:17.9420844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.9420928Z     self.fn.run(
2025-05-07T20:32:17.9421393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9421490Z     kernel = self.compile(
2025-05-07T20:32:17.9421867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9422051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9422184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9422191Z 
2025-05-07T20:32:17.9422403Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea361eb0>
2025-05-07T20:32:17.9423183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9423747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea51e040>}
2025-05-07T20:32:17.9424493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9424686Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea51d830>
2025-05-07T20:32:17.9424691Z 
2025-05-07T20:32:17.9424945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9425218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9425327Z                            module_map=module_map)
2025-05-07T20:32:17.9425495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9425605Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.9425684Z E       ^
2025-05-07T20:32:17.9426051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9426056Z 
2025-05-07T20:32:17.9426468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9426472Z 
2025-05-07T20:32:17.9426583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9426807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9426898Z     T=1,
2025-05-07T20:32:17.9426992Z     D=5120,
2025-05-07T20:32:17.9427078Z     scale_ub=1200.0,
2025-05-07T20:32:17.9427168Z     contiguous=False,
2025-05-07T20:32:17.9427259Z     compiled=True,
2025-05-07T20:32:17.9427334Z )
2025-05-07T20:32:17.9427562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9427770Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9427775Z 
2025-05-07T20:32:17.9427854Z     @given(
2025-05-07T20:32:17.9427984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9428085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9428201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9428330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9428445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9428520Z     )
2025-05-07T20:32:17.9428781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9428928Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9429015Z         self,
2025-05-07T20:32:17.9429094Z         T: int,
2025-05-07T20:32:17.9429173Z         D: int,
2025-05-07T20:32:17.9429282Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9429375Z         contiguous: bool,
2025-05-07T20:32:17.9429466Z         compiled: bool,
2025-05-07T20:32:17.9429555Z     ) -> None:
2025-05-07T20:32:17.9429650Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9429730Z     
2025-05-07T20:32:17.9429915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9429993Z     
2025-05-07T20:32:17.9430088Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9430223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9430314Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9430404Z         x0 = x[:, :D]
2025-05-07T20:32:17.9430487Z         x1 = x[:, D:]
2025-05-07T20:32:17.9430561Z     
2025-05-07T20:32:17.9430665Z         if contiguous:
2025-05-07T20:32:17.9430764Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9430855Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9430936Z     
2025-05-07T20:32:17.9431028Z         if scale_ub is not None:
2025-05-07T20:32:17.9431137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9431286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9431370Z             )
2025-05-07T20:32:17.9431451Z         else:
2025-05-07T20:32:17.9431555Z             scale_ub_tensor = None
2025-05-07T20:32:17.9431629Z     
2025-05-07T20:32:17.9431764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9431867Z             op = silu_mul_quant
2025-05-07T20:32:17.9431956Z             if compiled:
2025-05-07T20:32:17.9432064Z                 op = torch.compile(op)
2025-05-07T20:32:17.9432175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9432249Z     
2025-05-07T20:32:17.9432433Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9432439Z 
2025-05-07T20:32:17.9432539Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9432669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9432775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9432878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9433257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9433351Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9433846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9433949Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9434304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9434529Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9434883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9434979Z     kernel = self.compile(
2025-05-07T20:32:17.9435365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9435585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9435713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9435718Z 
2025-05-07T20:32:17.9435936Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea3a6760>
2025-05-07T20:32:17.9436707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9437220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea51ef70>}
2025-05-07T20:32:17.9438016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9438210Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea33d0f0>
2025-05-07T20:32:17.9438221Z 
2025-05-07T20:32:17.9438387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9438650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9438767Z                            module_map=module_map)
2025-05-07T20:32:17.9438930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9439033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9439117Z E       ^
2025-05-07T20:32:17.9439486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9439491Z 
2025-05-07T20:32:17.9439910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9439916Z 
2025-05-07T20:32:17.9440020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9440662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9440801Z     T=1,
2025-05-07T20:32:17.9440920Z     D=5120,
2025-05-07T20:32:17.9441055Z     scale_ub=1200.0,
2025-05-07T20:32:17.9441191Z     contiguous=False,
2025-05-07T20:32:17.9441310Z     compiled=False,
2025-05-07T20:32:17.9441412Z )
2025-05-07T20:32:17.9441657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9441830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9441836Z 
2025-05-07T20:32:17.9442206Z     @given(
2025-05-07T20:32:17.9442330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9442431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9442553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9442671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9442789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9442871Z     )
2025-05-07T20:32:17.9443119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9443214Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9443301Z         self,
2025-05-07T20:32:17.9443380Z         T: int,
2025-05-07T20:32:17.9443464Z         D: int,
2025-05-07T20:32:17.9443578Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9455854Z         contiguous: bool,
2025-05-07T20:32:17.9456029Z         compiled: bool,
2025-05-07T20:32:17.9456149Z     ) -> None:
2025-05-07T20:32:17.9456338Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9456448Z     
2025-05-07T20:32:17.9456691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9456796Z     
2025-05-07T20:32:17.9456921Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9457095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9457425Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9457543Z         x0 = x[:, :D]
2025-05-07T20:32:17.9457640Z         x1 = x[:, D:]
2025-05-07T20:32:17.9457719Z     
2025-05-07T20:32:17.9457810Z         if contiguous:
2025-05-07T20:32:17.9457919Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9458014Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9458090Z     
2025-05-07T20:32:17.9458195Z         if scale_ub is not None:
2025-05-07T20:32:17.9458311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9458457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9458549Z             )
2025-05-07T20:32:17.9458724Z         else:
2025-05-07T20:32:17.9458834Z             scale_ub_tensor = None
2025-05-07T20:32:17.9458914Z     
2025-05-07T20:32:17.9459050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9459156Z             op = silu_mul_quant
2025-05-07T20:32:17.9459247Z             if compiled:
2025-05-07T20:32:17.9459357Z                 op = torch.compile(op)
2025-05-07T20:32:17.9459477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9459554Z     
2025-05-07T20:32:17.9459652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9459657Z 
2025-05-07T20:32:17.9459773Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9459917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9460039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9460148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9460664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9460778Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9461290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9461539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9461962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9462088Z     kernel = self.compile(
2025-05-07T20:32:17.9462490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9462678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9462812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9462817Z 
2025-05-07T20:32:17.9463038Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea4431f0>
2025-05-07T20:32:17.9463917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9464451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e373a0>}
2025-05-07T20:32:17.9465213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9465412Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9e167f0>
2025-05-07T20:32:17.9465425Z 
2025-05-07T20:32:17.9465603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9465882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9466005Z                            module_map=module_map)
2025-05-07T20:32:17.9466177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9466281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9466375Z E       ^
2025-05-07T20:32:17.9466776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9466781Z 
2025-05-07T20:32:17.9467207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9467211Z 
2025-05-07T20:32:17.9467321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9467549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9467641Z     T=16384,
2025-05-07T20:32:17.9467724Z     D=5120,
2025-05-07T20:32:17.9467812Z     scale_ub=1200.0,
2025-05-07T20:32:17.9467963Z     contiguous=False,
2025-05-07T20:32:17.9468052Z     compiled=True,
2025-05-07T20:32:17.9468132Z )
2025-05-07T20:32:17.9468364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9468546Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9468553Z 
2025-05-07T20:32:17.9468643Z     @given(
2025-05-07T20:32:17.9468770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9468877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9469006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9469127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9469245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9469331Z     )
2025-05-07T20:32:17.9469584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9469689Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9469772Z         self,
2025-05-07T20:32:17.9469862Z         T: int,
2025-05-07T20:32:17.9469953Z         D: int,
2025-05-07T20:32:17.9470056Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9470150Z         contiguous: bool,
2025-05-07T20:32:17.9470248Z         compiled: bool,
2025-05-07T20:32:17.9470330Z     ) -> None:
2025-05-07T20:32:17.9470432Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9470515Z     
2025-05-07T20:32:17.9470688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9470765Z     
2025-05-07T20:32:17.9470869Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9471002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9471096Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9471187Z         x0 = x[:, :D]
2025-05-07T20:32:17.9471271Z         x1 = x[:, D:]
2025-05-07T20:32:17.9471354Z     
2025-05-07T20:32:17.9471443Z         if contiguous:
2025-05-07T20:32:17.9471539Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9471718Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9471801Z     
2025-05-07T20:32:17.9471899Z         if scale_ub is not None:
2025-05-07T20:32:17.9472018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9472161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9472242Z             )
2025-05-07T20:32:17.9493363Z         else:
2025-05-07T20:32:17.9493577Z             scale_ub_tensor = None
2025-05-07T20:32:17.9493654Z     
2025-05-07T20:32:17.9493807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9493903Z             op = silu_mul_quant
2025-05-07T20:32:17.9494005Z             if compiled:
2025-05-07T20:32:17.9494107Z                 op = torch.compile(op)
2025-05-07T20:32:17.9494215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9494292Z     
2025-05-07T20:32:17.9494383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9494389Z 
2025-05-07T20:32:17.9494494Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9494662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9494766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9494871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9495258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9495522Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9496022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9496120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9496476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9496706Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9497047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9497206Z     kernel = self.compile(
2025-05-07T20:32:17.9497599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9497776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9497910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9497915Z 
2025-05-07T20:32:17.9498124Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea3114f0>
2025-05-07T20:32:17.9498898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9499416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea45b0d0>}
2025-05-07T20:32:17.9500178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9500377Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4585f0>
2025-05-07T20:32:17.9500384Z 
2025-05-07T20:32:17.9500554Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9500833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9500945Z                            module_map=module_map)
2025-05-07T20:32:17.9501185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9501292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9501369Z E       ^
2025-05-07T20:32:17.9501730Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9501841Z 
2025-05-07T20:32:17.9502270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9502275Z 
2025-05-07T20:32:17.9502379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9502606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9502690Z     T=2048,
2025-05-07T20:32:17.9502770Z     D=7168,
2025-05-07T20:32:17.9502879Z     scale_ub=1200.0,
2025-05-07T20:32:17.9502974Z     contiguous=False,
2025-05-07T20:32:17.9503073Z     compiled=True,
2025-05-07T20:32:17.9503157Z )
2025-05-07T20:32:17.9503380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9503556Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9503566Z 
2025-05-07T20:32:17.9503641Z     @given(
2025-05-07T20:32:17.9503764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9503884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9504000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9504118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9504239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9504314Z     )
2025-05-07T20:32:17.9504611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9504712Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9504786Z         self,
2025-05-07T20:32:17.9504862Z         T: int,
2025-05-07T20:32:17.9504941Z         D: int,
2025-05-07T20:32:17.9505039Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9505133Z         contiguous: bool,
2025-05-07T20:32:17.9505218Z         compiled: bool,
2025-05-07T20:32:17.9505297Z     ) -> None:
2025-05-07T20:32:17.9505397Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9505472Z     
2025-05-07T20:32:17.9505652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9505825Z     
2025-05-07T20:32:17.9505918Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9506045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9506137Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9506219Z         x0 = x[:, :D]
2025-05-07T20:32:17.9506301Z         x1 = x[:, D:]
2025-05-07T20:32:17.9506384Z     
2025-05-07T20:32:17.9506468Z         if contiguous:
2025-05-07T20:32:17.9506566Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9506659Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9506735Z     
2025-05-07T20:32:17.9506839Z         if scale_ub is not None:
2025-05-07T20:32:17.9506948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9507087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9507178Z             )
2025-05-07T20:32:17.9507257Z         else:
2025-05-07T20:32:17.9507354Z             scale_ub_tensor = None
2025-05-07T20:32:17.9507437Z     
2025-05-07T20:32:17.9507578Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9507670Z             op = silu_mul_quant
2025-05-07T20:32:17.9507765Z             if compiled:
2025-05-07T20:32:17.9507866Z                 op = torch.compile(op)
2025-05-07T20:32:17.9507976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9508052Z     
2025-05-07T20:32:17.9508145Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9508150Z 
2025-05-07T20:32:17.9508255Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9508387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9508489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9508599Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9508974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9509066Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9509658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9509760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9510132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9510363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9510700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9510803Z     kernel = self.compile(
2025-05-07T20:32:17.9511183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9511366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9511494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9511499Z 
2025-05-07T20:32:17.9511713Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea508ee0>
2025-05-07T20:32:17.9512499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9513053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea45bca0>}
2025-05-07T20:32:17.9513806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9513999Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9dc23f0>
2025-05-07T20:32:17.9514004Z 
2025-05-07T20:32:17.9514181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9514501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9514611Z                            module_map=module_map)
2025-05-07T20:32:17.9514784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9514886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9514970Z E       ^
2025-05-07T20:32:17.9515340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9515345Z 
2025-05-07T20:32:17.9515757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9515762Z 
2025-05-07T20:32:17.9515875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9516099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9516180Z     T=1,
2025-05-07T20:32:17.9516268Z     D=5120,
2025-05-07T20:32:17.9516357Z     scale_ub=None,
2025-05-07T20:32:17.9516448Z     contiguous=False,
2025-05-07T20:32:17.9516546Z     compiled=False,
2025-05-07T20:32:17.9516622Z )
2025-05-07T20:32:17.9516845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9517023Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9517030Z 
2025-05-07T20:32:17.9517110Z     @given(
2025-05-07T20:32:17.9517240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9517342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9517458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9517586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9517703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9517780Z     )
2025-05-07T20:32:17.9518035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9518130Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9518302Z         self,
2025-05-07T20:32:17.9518382Z         T: int,
2025-05-07T20:32:17.9518461Z         D: int,
2025-05-07T20:32:17.9518568Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9518659Z         contiguous: bool,
2025-05-07T20:32:17.9518749Z         compiled: bool,
2025-05-07T20:32:17.9518840Z     ) -> None:
2025-05-07T20:32:17.9518934Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9519009Z     
2025-05-07T20:32:17.9519188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9519264Z     
2025-05-07T20:32:17.9519357Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9519491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9519581Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9519663Z         x0 = x[:, :D]
2025-05-07T20:32:17.9519753Z         x1 = x[:, D:]
2025-05-07T20:32:17.9519826Z     
2025-05-07T20:32:17.9519917Z         if contiguous:
2025-05-07T20:32:17.9520011Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9520113Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9520193Z     
2025-05-07T20:32:17.9520285Z         if scale_ub is not None:
2025-05-07T20:32:17.9520393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9520542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9520665Z             )
2025-05-07T20:32:17.9520741Z         else:
2025-05-07T20:32:17.9520849Z             scale_ub_tensor = None
2025-05-07T20:32:17.9520924Z     
2025-05-07T20:32:17.9521057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9521159Z             op = silu_mul_quant
2025-05-07T20:32:17.9521246Z             if compiled:
2025-05-07T20:32:17.9521355Z                 op = torch.compile(op)
2025-05-07T20:32:17.9521464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9521538Z     
2025-05-07T20:32:17.9521640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9521645Z 
2025-05-07T20:32:17.9521748Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9521924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9522035Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9522136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9522635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9522745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9523112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9523347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9523690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9523788Z     kernel = self.compile(
2025-05-07T20:32:17.9524191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9524373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9524515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9524519Z 
2025-05-07T20:32:17.9524728Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea1f3a30>
2025-05-07T20:32:17.9525501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9526009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea15a670>}
2025-05-07T20:32:17.9526866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9527073Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea065c70>
2025-05-07T20:32:17.9527078Z 
2025-05-07T20:32:17.9527246Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9527512Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9527627Z                            module_map=module_map)
2025-05-07T20:32:17.9527789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9527897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9527975Z E       ^
2025-05-07T20:32:17.9528335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9528340Z 
2025-05-07T20:32:17.9528761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9528767Z 
2025-05-07T20:32:17.9528870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9529100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9529180Z     T=4096,
2025-05-07T20:32:17.9529260Z     D=7168,
2025-05-07T20:32:17.9529394Z     scale_ub=1200.0,
2025-05-07T20:32:17.9529481Z     contiguous=False,
2025-05-07T20:32:17.9529567Z     compiled=False,
2025-05-07T20:32:17.9529651Z )
2025-05-07T20:32:17.9529871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9530050Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9530054Z 
2025-05-07T20:32:17.9530143Z     @given(
2025-05-07T20:32:17.9530264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9530374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9530491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9530657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9530779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9530856Z     )
2025-05-07T20:32:17.9531103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9531207Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9531290Z         self,
2025-05-07T20:32:17.9531370Z         T: int,
2025-05-07T20:32:17.9531456Z         D: int,
2025-05-07T20:32:17.9531562Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9531652Z         contiguous: bool,
2025-05-07T20:32:17.9531738Z         compiled: bool,
2025-05-07T20:32:17.9531827Z     ) -> None:
2025-05-07T20:32:17.9531923Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9531996Z     
2025-05-07T20:32:17.9532173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9532246Z     
2025-05-07T20:32:17.9532347Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9532481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9532569Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9532657Z         x0 = x[:, :D]
2025-05-07T20:32:17.9532738Z         x1 = x[:, D:]
2025-05-07T20:32:17.9532810Z     
2025-05-07T20:32:17.9532900Z         if contiguous:
2025-05-07T20:32:17.9532992Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9533083Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9533164Z     
2025-05-07T20:32:17.9533258Z         if scale_ub is not None:
2025-05-07T20:32:17.9533364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9533507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9533585Z             )
2025-05-07T20:32:17.9533668Z         else:
2025-05-07T20:32:17.9533763Z             scale_ub_tensor = None
2025-05-07T20:32:17.9533837Z     
2025-05-07T20:32:17.9533975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9534067Z             op = silu_mul_quant
2025-05-07T20:32:17.9534239Z             if compiled:
2025-05-07T20:32:17.9534349Z                 op = torch.compile(op)
2025-05-07T20:32:17.9534455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9534528Z     
2025-05-07T20:32:17.9534628Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9534632Z 
2025-05-07T20:32:17.9534732Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9534861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9534969Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9535070Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9535584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9535681Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9536040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9536282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9536628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9536730Z     kernel = self.compile(
2025-05-07T20:32:17.9537109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9537327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9537459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9537464Z 
2025-05-07T20:32:17.9537670Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea083640>
2025-05-07T20:32:17.9538444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9539001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea256040>}
2025-05-07T20:32:17.9539753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9539953Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea251d30>
2025-05-07T20:32:17.9539958Z 
2025-05-07T20:32:17.9540484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9540831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9540943Z                            module_map=module_map)
2025-05-07T20:32:17.9541167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9541276Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9541365Z E       ^
2025-05-07T20:32:17.9541719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9541730Z 
2025-05-07T20:32:17.9542150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9542158Z 
2025-05-07T20:32:17.9542263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9542494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9542571Z     T=16384,
2025-05-07T20:32:17.9542648Z     D=7168,
2025-05-07T20:32:17.9542736Z     scale_ub=None,
2025-05-07T20:32:17.9542821Z     contiguous=True,
2025-05-07T20:32:17.9542904Z     compiled=True,
2025-05-07T20:32:17.9542983Z )
2025-05-07T20:32:17.9543199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9543398Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9543672Z 
2025-05-07T20:32:17.9543755Z     @given(
2025-05-07T20:32:17.9543874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9543979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9544094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9544214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9544336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9544410Z     )
2025-05-07T20:32:17.9544656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9544757Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9544834Z         self,
2025-05-07T20:32:17.9544917Z         T: int,
2025-05-07T20:32:17.9544995Z         D: int,
2025-05-07T20:32:17.9545093Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9545188Z         contiguous: bool,
2025-05-07T20:32:17.9545276Z         compiled: bool,
2025-05-07T20:32:17.9545357Z     ) -> None:
2025-05-07T20:32:17.9545465Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9545538Z     
2025-05-07T20:32:17.9545709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9545789Z     
2025-05-07T20:32:17.9545882Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9546078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9546178Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9546261Z         x0 = x[:, :D]
2025-05-07T20:32:17.9546340Z         x1 = x[:, D:]
2025-05-07T20:32:17.9546421Z     
2025-05-07T20:32:17.9546507Z         if contiguous:
2025-05-07T20:32:17.9546605Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9546694Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9546767Z     
2025-05-07T20:32:17.9546865Z         if scale_ub is not None:
2025-05-07T20:32:17.9546971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9547107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9547261Z             )
2025-05-07T20:32:17.9547340Z         else:
2025-05-07T20:32:17.9547435Z             scale_ub_tensor = None
2025-05-07T20:32:17.9547518Z     
2025-05-07T20:32:17.9547649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9547738Z             op = silu_mul_quant
2025-05-07T20:32:17.9547834Z             if compiled:
2025-05-07T20:32:17.9547933Z                 op = torch.compile(op)
2025-05-07T20:32:17.9548046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9548118Z     
2025-05-07T20:32:17.9548209Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9548213Z 
2025-05-07T20:32:17.9548317Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9548444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9548545Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9548654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9549024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9549119Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9549620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9549718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9550092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9550316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9550654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9550755Z     kernel = self.compile(
2025-05-07T20:32:17.9551133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9551316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9551526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9551532Z 
2025-05-07T20:32:17.9551741Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea07e1c0>
2025-05-07T20:32:17.9552519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9553038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea256ca0>}
2025-05-07T20:32:17.9553784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9553979Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9f5aef0>
2025-05-07T20:32:17.9553987Z 
2025-05-07T20:32:17.9554155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9554423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9554571Z                            module_map=module_map)
2025-05-07T20:32:17.9554741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9554840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9554917Z E       ^
2025-05-07T20:32:17.9555274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9555278Z 
2025-05-07T20:32:17.9555688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9555692Z 
2025-05-07T20:32:17.9555801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9556107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9556185Z     T=4096,
2025-05-07T20:32:17.9556271Z     D=5120,
2025-05-07T20:32:17.9556353Z     scale_ub=None,
2025-05-07T20:32:17.9556438Z     contiguous=False,
2025-05-07T20:32:17.9556527Z     compiled=True,
2025-05-07T20:32:17.9556602Z )
2025-05-07T20:32:17.9556821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9557027Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9557032Z 
2025-05-07T20:32:17.9557109Z     @given(
2025-05-07T20:32:17.9557228Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9557335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9557449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9557566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9557687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9557770Z     )
2025-05-07T20:32:17.9558028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9558122Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9558199Z         self,
2025-05-07T20:32:17.9558284Z         T: int,
2025-05-07T20:32:17.9558361Z         D: int,
2025-05-07T20:32:17.9558464Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9558561Z         contiguous: bool,
2025-05-07T20:32:17.9558648Z         compiled: bool,
2025-05-07T20:32:17.9558726Z     ) -> None:
2025-05-07T20:32:17.9558828Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9558904Z     
2025-05-07T20:32:17.9559074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9559156Z     
2025-05-07T20:32:17.9559248Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9559383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9559474Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9559556Z         x0 = x[:, :D]
2025-05-07T20:32:17.9559724Z         x1 = x[:, D:]
2025-05-07T20:32:17.9559801Z     
2025-05-07T20:32:17.9559886Z         if contiguous:
2025-05-07T20:32:17.9559984Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9560074Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9560147Z     
2025-05-07T20:32:17.9560252Z         if scale_ub is not None:
2025-05-07T20:32:17.9560359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9560496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9560587Z             )
2025-05-07T20:32:17.9560665Z         else:
2025-05-07T20:32:17.9560762Z             scale_ub_tensor = None
2025-05-07T20:32:17.9560843Z     
2025-05-07T20:32:17.9560977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9561078Z             op = silu_mul_quant
2025-05-07T20:32:17.9561163Z             if compiled:
2025-05-07T20:32:17.9561262Z                 op = torch.compile(op)
2025-05-07T20:32:17.9561379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9561455Z     
2025-05-07T20:32:17.9561547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9561551Z 
2025-05-07T20:32:17.9561659Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9561786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9561982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9562096Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9562463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9562562Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9563054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9563150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9563513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9563791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9564144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9564237Z     kernel = self.compile(
2025-05-07T20:32:17.9564624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9564812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9564939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9564944Z 
2025-05-07T20:32:17.9565149Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9f7ba60>
2025-05-07T20:32:17.9565933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9566440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ffb8b0>}
2025-05-07T20:32:17.9567187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9567382Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea09b070>
2025-05-07T20:32:17.9567387Z 
2025-05-07T20:32:17.9567561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9567824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9567933Z                            module_map=module_map)
2025-05-07T20:32:17.9568101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9568282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9568363Z E       ^
2025-05-07T20:32:17.9568724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9568728Z 
2025-05-07T20:32:17.9569140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9569146Z 
2025-05-07T20:32:17.9569256Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9569479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9569561Z     T=4096,
2025-05-07T20:32:17.9569645Z     D=5120,
2025-05-07T20:32:17.9569728Z     scale_ub=1200.0,
2025-05-07T20:32:17.9569814Z     contiguous=False,
2025-05-07T20:32:17.9569909Z     compiled=False,
2025-05-07T20:32:17.9569983Z )
2025-05-07T20:32:17.9570211Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9570395Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9570399Z 
2025-05-07T20:32:17.9570479Z     @given(
2025-05-07T20:32:17.9570604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9570704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9570865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9570992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9571107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9571184Z     )
2025-05-07T20:32:17.9571438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9571533Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9571615Z         self,
2025-05-07T20:32:17.9571694Z         T: int,
2025-05-07T20:32:17.9571776Z         D: int,
2025-05-07T20:32:17.9571879Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9571970Z         contiguous: bool,
2025-05-07T20:32:17.9572105Z         compiled: bool,
2025-05-07T20:32:17.9572193Z     ) -> None:
2025-05-07T20:32:17.9572290Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9572367Z     
2025-05-07T20:32:17.9572541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9572617Z     
2025-05-07T20:32:17.9572713Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9572850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9572940Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9573028Z         x0 = x[:, :D]
2025-05-07T20:32:17.9573109Z         x1 = x[:, D:]
2025-05-07T20:32:17.9573183Z     
2025-05-07T20:32:17.9573274Z         if contiguous:
2025-05-07T20:32:17.9573368Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9573460Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9573540Z     
2025-05-07T20:32:17.9573632Z         if scale_ub is not None:
2025-05-07T20:32:17.9573739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9573887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9573966Z             )
2025-05-07T20:32:17.9574043Z         else:
2025-05-07T20:32:17.9574144Z             scale_ub_tensor = None
2025-05-07T20:32:17.9574217Z     
2025-05-07T20:32:17.9574350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9574453Z             op = silu_mul_quant
2025-05-07T20:32:17.9574541Z             if compiled:
2025-05-07T20:32:17.9574649Z                 op = torch.compile(op)
2025-05-07T20:32:17.9574754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9574828Z     
2025-05-07T20:32:17.9574929Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9574934Z 
2025-05-07T20:32:17.9575033Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9575160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9575267Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9575367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9575955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9576053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9576418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9576655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9576993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9577089Z     kernel = self.compile(
2025-05-07T20:32:17.9577475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9577650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9577781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9577789Z 
2025-05-07T20:32:17.9578003Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9efc1f0>
2025-05-07T20:32:17.9578788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9579346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea3d0040>}
2025-05-07T20:32:17.9580097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9580295Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9ef4130>
2025-05-07T20:32:17.9580299Z 
2025-05-07T20:32:17.9580470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9580780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9580894Z                            module_map=module_map)
2025-05-07T20:32:17.9581055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9581368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9581445Z E       ^
2025-05-07T20:32:17.9581799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9581803Z 
2025-05-07T20:32:17.9582223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9582228Z 
2025-05-07T20:32:17.9582330Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9582558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9582635Z     T=4096,
2025-05-07T20:32:17.9582720Z     D=5120,
2025-05-07T20:32:17.9582818Z     scale_ub=1200.0,
2025-05-07T20:32:17.9582920Z     contiguous=False,
2025-05-07T20:32:17.9583013Z     compiled=True,
2025-05-07T20:32:17.9583107Z )
2025-05-07T20:32:17.9583327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9583506Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9583511Z 
2025-05-07T20:32:17.9583595Z     @given(
2025-05-07T20:32:17.9583713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9583818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9583933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9584049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9584169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9584245Z     )
2025-05-07T20:32:17.9584533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9584789Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9584913Z         self,
2025-05-07T20:32:17.9585021Z         T: int,
2025-05-07T20:32:17.9585137Z         D: int,
2025-05-07T20:32:17.9585281Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9585405Z         contiguous: bool,
2025-05-07T20:32:17.9585534Z         compiled: bool,
2025-05-07T20:32:17.9585643Z     ) -> None:
2025-05-07T20:32:17.9585753Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9585828Z     
2025-05-07T20:32:17.9586001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9586082Z     
2025-05-07T20:32:17.9586174Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9586301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9586396Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9586478Z         x0 = x[:, :D]
2025-05-07T20:32:17.9586565Z         x1 = x[:, D:]
2025-05-07T20:32:17.9586644Z     
2025-05-07T20:32:17.9586729Z         if contiguous:
2025-05-07T20:32:17.9586832Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9586930Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9587004Z     
2025-05-07T20:32:17.9587105Z         if scale_ub is not None:
2025-05-07T20:32:17.9587213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9587437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9587522Z             )
2025-05-07T20:32:17.9587601Z         else:
2025-05-07T20:32:17.9587697Z             scale_ub_tensor = None
2025-05-07T20:32:17.9587780Z     
2025-05-07T20:32:17.9587915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9588007Z             op = silu_mul_quant
2025-05-07T20:32:17.9588101Z             if compiled:
2025-05-07T20:32:17.9588201Z                 op = torch.compile(op)
2025-05-07T20:32:17.9588307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9588385Z     
2025-05-07T20:32:17.9588476Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9588531Z 
2025-05-07T20:32:17.9588640Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9588769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9588870Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9588979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9589346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9589440Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9589941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9590039Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9590403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9590628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9590983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9591086Z     kernel = self.compile(
2025-05-07T20:32:17.9591465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9591648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9591782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9591786Z 
2025-05-07T20:32:17.9591993Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9fccb50>
2025-05-07T20:32:17.9592771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9593348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7ea3d0ee0>}
2025-05-07T20:32:17.9594099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9594296Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea4000b0>
2025-05-07T20:32:17.9594300Z 
2025-05-07T20:32:17.9594465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9594734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9594842Z                            module_map=module_map)
2025-05-07T20:32:17.9595008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9595107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9595185Z E       ^
2025-05-07T20:32:17.9595551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9595558Z 
2025-05-07T20:32:17.9595973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9595977Z 
2025-05-07T20:32:17.9596127Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9596355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9596432Z     T=2048,
2025-05-07T20:32:17.9596515Z     D=7168,
2025-05-07T20:32:17.9596598Z     scale_ub=1200.0,
2025-05-07T20:32:17.9596685Z     contiguous=False,
2025-05-07T20:32:17.9596775Z     compiled=False,
2025-05-07T20:32:17.9596849Z )
2025-05-07T20:32:17.9597067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9597251Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9597256Z 
2025-05-07T20:32:17.9597375Z     @given(
2025-05-07T20:32:17.9597499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9597611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9597727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9597853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9597973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9598048Z     )
2025-05-07T20:32:17.9598300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9598395Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9598473Z         self,
2025-05-07T20:32:17.9598555Z         T: int,
2025-05-07T20:32:17.9598636Z         D: int,
2025-05-07T20:32:17.9598736Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9598832Z         contiguous: bool,
2025-05-07T20:32:17.9598920Z         compiled: bool,
2025-05-07T20:32:17.9598999Z     ) -> None:
2025-05-07T20:32:17.9599099Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9599181Z     
2025-05-07T20:32:17.9599358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9599432Z     
2025-05-07T20:32:17.9599526Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9599659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9599752Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9599834Z         x0 = x[:, :D]
2025-05-07T20:32:17.9599922Z         x1 = x[:, D:]
2025-05-07T20:32:17.9599997Z     
2025-05-07T20:32:17.9600083Z         if contiguous:
2025-05-07T20:32:17.9606485Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9606613Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9606690Z     
2025-05-07T20:32:17.9606786Z         if scale_ub is not None:
2025-05-07T20:32:17.9606907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9607049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9607130Z             )
2025-05-07T20:32:17.9607219Z         else:
2025-05-07T20:32:17.9607452Z             scale_ub_tensor = None
2025-05-07T20:32:17.9607530Z     
2025-05-07T20:32:17.9607680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9607775Z             op = silu_mul_quant
2025-05-07T20:32:17.9607863Z             if compiled:
2025-05-07T20:32:17.9607978Z                 op = torch.compile(op)
2025-05-07T20:32:17.9608087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9608171Z     
2025-05-07T20:32:17.9608265Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9608270Z 
2025-05-07T20:32:17.9608371Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9608514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9608621Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9608725Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9609238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9609346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9609716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9609947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9610339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9610445Z     kernel = self.compile(
2025-05-07T20:32:17.9610825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9611005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9611143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9611147Z 
2025-05-07T20:32:17.9611356Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9f8e430>
2025-05-07T20:32:17.9612195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9612700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e90550>}
2025-05-07T20:32:17.9613461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9613657Z context = <triton._C.libtriton.ir.context object at 0x7fd7ea03d2b0>
2025-05-07T20:32:17.9613662Z 
2025-05-07T20:32:17.9613832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9614117Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9614229Z                            module_map=module_map)
2025-05-07T20:32:17.9614406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9614507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9614588Z E       ^
2025-05-07T20:32:17.9614955Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9614962Z 
2025-05-07T20:32:17.9615379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9615384Z 
2025-05-07T20:32:17.9615489Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9615721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9615800Z     T=1,
2025-05-07T20:32:17.9615889Z     D=7168,
2025-05-07T20:32:17.9615973Z     scale_ub=None,
2025-05-07T20:32:17.9616059Z     contiguous=True,
2025-05-07T20:32:17.9616231Z     compiled=False,
2025-05-07T20:32:17.9616310Z )
2025-05-07T20:32:17.9616528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9616706Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9616711Z 
2025-05-07T20:32:17.9616793Z     @given(
2025-05-07T20:32:17.9616913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9617022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9617140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9617266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9617382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9617457Z     )
2025-05-07T20:32:17.9617714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9617809Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9617888Z         self,
2025-05-07T20:32:17.9617977Z         T: int,
2025-05-07T20:32:17.9618065Z         D: int,
2025-05-07T20:32:17.9618167Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9618267Z         contiguous: bool,
2025-05-07T20:32:17.9618356Z         compiled: bool,
2025-05-07T20:32:17.9618437Z     ) -> None:
2025-05-07T20:32:17.9618540Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9618659Z     
2025-05-07T20:32:17.9618841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9618918Z     
2025-05-07T20:32:17.9619010Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9619146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9619238Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9619320Z         x0 = x[:, :D]
2025-05-07T20:32:17.9619409Z         x1 = x[:, D:]
2025-05-07T20:32:17.9619484Z     
2025-05-07T20:32:17.9619569Z         if contiguous:
2025-05-07T20:32:17.9619669Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9619760Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9619881Z     
2025-05-07T20:32:17.9619982Z         if scale_ub is not None:
2025-05-07T20:32:17.9620092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9620239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9620318Z             )
2025-05-07T20:32:17.9620397Z         else:
2025-05-07T20:32:17.9620501Z             scale_ub_tensor = None
2025-05-07T20:32:17.9620574Z     
2025-05-07T20:32:17.9620710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9620810Z             op = silu_mul_quant
2025-05-07T20:32:17.9620900Z             if compiled:
2025-05-07T20:32:17.9621002Z                 op = torch.compile(op)
2025-05-07T20:32:17.9621197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9621272Z     
2025-05-07T20:32:17.9621367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9621371Z 
2025-05-07T20:32:17.9621480Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9621615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9621729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9621832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9622331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9622444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9622809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9623035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9623381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9623480Z     kernel = self.compile(
2025-05-07T20:32:17.9623870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9624171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9624300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9624305Z 
2025-05-07T20:32:17.9624522Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9e921f0>
2025-05-07T20:32:17.9625297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9625810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e51160>}
2025-05-07T20:32:17.9626563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9626764Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9e52a30>
2025-05-07T20:32:17.9626777Z 
2025-05-07T20:32:17.9626948Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9627212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9627369Z                            module_map=module_map)
2025-05-07T20:32:17.9627537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9627637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9627727Z E       ^
2025-05-07T20:32:17.9628080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9628085Z 
2025-05-07T20:32:17.9628505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9628510Z 
2025-05-07T20:32:17.9628664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9628886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9628975Z     T=16384,
2025-05-07T20:32:17.9629054Z     D=7168,
2025-05-07T20:32:17.9629137Z     scale_ub=1200.0,
2025-05-07T20:32:17.9629236Z     contiguous=False,
2025-05-07T20:32:17.9629322Z     compiled=True,
2025-05-07T20:32:17.9629397Z )
2025-05-07T20:32:17.9629623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9629803Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9629807Z 
2025-05-07T20:32:17.9629895Z     @given(
2025-05-07T20:32:17.9630015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9630117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9630242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9630361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9630484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9630571Z     )
2025-05-07T20:32:17.9630820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9630922Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9631001Z         self,
2025-05-07T20:32:17.9631083Z         T: int,
2025-05-07T20:32:17.9631167Z         D: int,
2025-05-07T20:32:17.9631266Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9631357Z         contiguous: bool,
2025-05-07T20:32:17.9631453Z         compiled: bool,
2025-05-07T20:32:17.9631533Z     ) -> None:
2025-05-07T20:32:17.9631629Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9631712Z     
2025-05-07T20:32:17.9631884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9631960Z     
2025-05-07T20:32:17.9632061Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9632187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9632355Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9632452Z         x0 = x[:, :D]
2025-05-07T20:32:17.9632534Z         x1 = x[:, D:]
2025-05-07T20:32:17.9632617Z     
2025-05-07T20:32:17.9632701Z         if contiguous:
2025-05-07T20:32:17.9632793Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9632893Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9632968Z     
2025-05-07T20:32:17.9633059Z         if scale_ub is not None:
2025-05-07T20:32:17.9633172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9633311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9633391Z             )
2025-05-07T20:32:17.9633476Z         else:
2025-05-07T20:32:17.9633571Z             scale_ub_tensor = None
2025-05-07T20:32:17.9633646Z     
2025-05-07T20:32:17.9633786Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9633877Z             op = silu_mul_quant
2025-05-07T20:32:17.9633970Z             if compiled:
2025-05-07T20:32:17.9634075Z                 op = torch.compile(op)
2025-05-07T20:32:17.9634184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9634263Z     
2025-05-07T20:32:17.9634354Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9634359Z 
2025-05-07T20:32:17.9634459Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9634595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9634742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9634846Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9635223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9635317Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9635825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9635923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9636286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9636568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9636908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9637015Z     kernel = self.compile(
2025-05-07T20:32:17.9637398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9637574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9637709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9637714Z 
2025-05-07T20:32:17.9637923Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7ea02c670>
2025-05-07T20:32:17.9638696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9639219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9e51dc0>}
2025-05-07T20:32:17.9639972Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9641313Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9d58930>
2025-05-07T20:32:17.9641360Z 
2025-05-07T20:32:17.9641637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9641932Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9642051Z                            module_map=module_map)
2025-05-07T20:32:17.9642627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9642755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9642892Z E       ^
2025-05-07T20:32:17.9643594Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9643614Z 
2025-05-07T20:32:17.9644419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9644428Z 
2025-05-07T20:32:17.9644622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9645044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9645191Z     T=1,
2025-05-07T20:32:17.9645335Z     D=7168,
2025-05-07T20:32:17.9645503Z     scale_ub=None,
2025-05-07T20:32:17.9645667Z     contiguous=False,
2025-05-07T20:32:17.9645825Z     compiled=False,
2025-05-07T20:32:17.9645978Z )
2025-05-07T20:32:17.9646390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9646706Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9646724Z 
2025-05-07T20:32:17.9646871Z     @given(
2025-05-07T20:32:17.9647093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9647292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9647635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9647853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9648077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9648232Z     )
2025-05-07T20:32:17.9648692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9648880Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9649028Z         self,
2025-05-07T20:32:17.9649172Z         T: int,
2025-05-07T20:32:17.9649325Z         D: int,
2025-05-07T20:32:17.9649507Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9649822Z         contiguous: bool,
2025-05-07T20:32:17.9649991Z         compiled: bool,
2025-05-07T20:32:17.9650142Z     ) -> None:
2025-05-07T20:32:17.9650318Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9650465Z     
2025-05-07T20:32:17.9650779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9650922Z     
2025-05-07T20:32:17.9651103Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9651335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9651504Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9651664Z         x0 = x[:, :D]
2025-05-07T20:32:17.9651814Z         x1 = x[:, D:]
2025-05-07T20:32:17.9651959Z     
2025-05-07T20:32:17.9652118Z         if contiguous:
2025-05-07T20:32:17.9652292Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9652470Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9652604Z     
2025-05-07T20:32:17.9652775Z         if scale_ub is not None:
2025-05-07T20:32:17.9652977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9653176Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9653256Z             )
2025-05-07T20:32:17.9653346Z         else:
2025-05-07T20:32:17.9653445Z             scale_ub_tensor = None
2025-05-07T20:32:17.9653520Z     
2025-05-07T20:32:17.9653660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9653759Z             op = silu_mul_quant
2025-05-07T20:32:17.9653854Z             if compiled:
2025-05-07T20:32:17.9653960Z                 op = torch.compile(op)
2025-05-07T20:32:17.9654069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9654153Z     
2025-05-07T20:32:17.9654247Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9654251Z 
2025-05-07T20:32:17.9654351Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9654485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9654585Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9654771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9655294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9655392Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9655759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9655990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9656334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9656435Z     kernel = self.compile(
2025-05-07T20:32:17.9656822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9657001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9657142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9657149Z 
2025-05-07T20:32:17.9657358Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9cfe550>
2025-05-07T20:32:17.9658140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9658696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ed0790>}
2025-05-07T20:32:17.9659454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9659646Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9cb23b0>
2025-05-07T20:32:17.9659651Z 
2025-05-07T20:32:17.9659867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9660139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9660248Z                            module_map=module_map)
2025-05-07T20:32:17.9660419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9660520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9660598Z E       ^
2025-05-07T20:32:17.9660958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9660963Z 
2025-05-07T20:32:17.9661482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9661487Z 
2025-05-07T20:32:17.9661596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9661818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9661908Z     T=2048,
2025-05-07T20:32:17.9661991Z     D=7168,
2025-05-07T20:32:17.9662075Z     scale_ub=None,
2025-05-07T20:32:17.9662164Z     contiguous=False,
2025-05-07T20:32:17.9662256Z     compiled=True,
2025-05-07T20:32:17.9662330Z )
2025-05-07T20:32:17.9662550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9662736Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9662741Z 
2025-05-07T20:32:17.9662824Z     @given(
2025-05-07T20:32:17.9662963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9663089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9663210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9663333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9663446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9663522Z     )
2025-05-07T20:32:17.9663860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9663960Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9664042Z         self,
2025-05-07T20:32:17.9664132Z         T: int,
2025-05-07T20:32:17.9664209Z         D: int,
2025-05-07T20:32:17.9664309Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9664410Z         contiguous: bool,
2025-05-07T20:32:17.9664496Z         compiled: bool,
2025-05-07T20:32:17.9664583Z     ) -> None:
2025-05-07T20:32:17.9664679Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9664752Z     
2025-05-07T20:32:17.9664929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9665004Z     
2025-05-07T20:32:17.9665098Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9665232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9665324Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9665407Z         x0 = x[:, :D]
2025-05-07T20:32:17.9665497Z         x1 = x[:, D:]
2025-05-07T20:32:17.9665574Z     
2025-05-07T20:32:17.9665665Z         if contiguous:
2025-05-07T20:32:17.9665766Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9665855Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9665929Z     
2025-05-07T20:32:17.9666029Z         if scale_ub is not None:
2025-05-07T20:32:17.9666135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9666323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9666402Z             )
2025-05-07T20:32:17.9666480Z         else:
2025-05-07T20:32:17.9666581Z             scale_ub_tensor = None
2025-05-07T20:32:17.9666655Z     
2025-05-07T20:32:17.9666788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9666885Z             op = silu_mul_quant
2025-05-07T20:32:17.9666972Z             if compiled:
2025-05-07T20:32:17.9667075Z                 op = torch.compile(op)
2025-05-07T20:32:17.9667187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9667261Z     
2025-05-07T20:32:17.9667402Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9667413Z 
2025-05-07T20:32:17.9667512Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9667642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9667750Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9667853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9668225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9668326Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9668820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9668922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9669287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9669517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9669872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9669967Z     kernel = self.compile(
2025-05-07T20:32:17.9670346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9670535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9670663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9670667Z 
2025-05-07T20:32:17.9670877Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9cc4d90>
2025-05-07T20:32:17.9671649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9672231Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9f2f430>}
2025-05-07T20:32:17.9673071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9673316Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9f145f0>
2025-05-07T20:32:17.9673321Z 
2025-05-07T20:32:17.9673576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9673865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9673990Z                            module_map=module_map)
2025-05-07T20:32:17.9674189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9674319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9674439Z E       ^
2025-05-07T20:32:17.9674809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9674813Z 
2025-05-07T20:32:17.9675231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9675303Z 
2025-05-07T20:32:17.9675429Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9675652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9675738Z     T=4096,
2025-05-07T20:32:17.9675815Z     D=7168,
2025-05-07T20:32:17.9675899Z     scale_ub=None,
2025-05-07T20:32:17.9675994Z     contiguous=False,
2025-05-07T20:32:17.9676082Z     compiled=True,
2025-05-07T20:32:17.9676155Z )
2025-05-07T20:32:17.9676384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9676559Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9676605Z 
2025-05-07T20:32:17.9676698Z     @given(
2025-05-07T20:32:17.9676820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9676921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9677043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9677161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9677276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9677358Z     )
2025-05-07T20:32:17.9677609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9677710Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9677791Z         self,
2025-05-07T20:32:17.9677870Z         T: int,
2025-05-07T20:32:17.9677952Z         D: int,
2025-05-07T20:32:17.9678051Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9678141Z         contiguous: bool,
2025-05-07T20:32:17.9678234Z         compiled: bool,
2025-05-07T20:32:17.9678314Z     ) -> None:
2025-05-07T20:32:17.9678418Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9678498Z     
2025-05-07T20:32:17.9678665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9678742Z     
2025-05-07T20:32:17.9678841Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9678966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9679061Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9679149Z         x0 = x[:, :D]
2025-05-07T20:32:17.9679232Z         x1 = x[:, D:]
2025-05-07T20:32:17.9679311Z     
2025-05-07T20:32:17.9679396Z         if contiguous:
2025-05-07T20:32:17.9679490Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9679585Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9679660Z     
2025-05-07T20:32:17.9679749Z         if scale_ub is not None:
2025-05-07T20:32:17.9679860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9679994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9680071Z             )
2025-05-07T20:32:17.9680241Z         else:
2025-05-07T20:32:17.9680338Z             scale_ub_tensor = None
2025-05-07T20:32:17.9680411Z     
2025-05-07T20:32:17.9680547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9680638Z             op = silu_mul_quant
2025-05-07T20:32:17.9680730Z             if compiled:
2025-05-07T20:32:17.9680836Z                 op = torch.compile(op)
2025-05-07T20:32:17.9680941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9681024Z     
2025-05-07T20:32:17.9681115Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9681119Z 
2025-05-07T20:32:17.9681217Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9681350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9681450Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9681551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9681929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9682025Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9682529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9682629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9683035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9683266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9683604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9683700Z     kernel = self.compile(
2025-05-07T20:32:17.9684093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9684271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9684453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9684457Z 
2025-05-07T20:32:17.9684667Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9bfbbb0>
2025-05-07T20:32:17.9685443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9685963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9b48040>}
2025-05-07T20:32:17.9686717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9686917Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9b49630>
2025-05-07T20:32:17.9686930Z 
2025-05-07T20:32:17.9687099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9687370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9687478Z                            module_map=module_map)
2025-05-07T20:32:17.9687644Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9687750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9687828Z E       ^
2025-05-07T20:32:17.9688180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9688185Z 
2025-05-07T20:32:17.9688610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9688614Z 
2025-05-07T20:32:17.9688716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9689026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9689111Z     T=16384,
2025-05-07T20:32:17.9689193Z     D=5120,
2025-05-07T20:32:17.9689283Z     scale_ub=1200.0,
2025-05-07T20:32:17.9689374Z     contiguous=False,
2025-05-07T20:32:17.9689460Z     compiled=False,
2025-05-07T20:32:17.9689540Z )
2025-05-07T20:32:17.9689760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9689940Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9689950Z 
2025-05-07T20:32:17.9690027Z     @given(
2025-05-07T20:32:17.9690147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9690254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9690369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9690488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9690609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9690685Z     )
2025-05-07T20:32:17.9690941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9691043Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9691120Z         self,
2025-05-07T20:32:17.9691206Z         T: int,
2025-05-07T20:32:17.9691284Z         D: int,
2025-05-07T20:32:17.9691427Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9691524Z         contiguous: bool,
2025-05-07T20:32:17.9691610Z         compiled: bool,
2025-05-07T20:32:17.9691688Z     ) -> None:
2025-05-07T20:32:17.9691791Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9691864Z     
2025-05-07T20:32:17.9692033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9692116Z     
2025-05-07T20:32:17.9692208Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9692332Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9692434Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9692515Z         x0 = x[:, :D]
2025-05-07T20:32:17.9692647Z         x1 = x[:, D:]
2025-05-07T20:32:17.9692727Z     
2025-05-07T20:32:17.9692811Z         if contiguous:
2025-05-07T20:32:17.9692915Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9693027Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9693103Z     
2025-05-07T20:32:17.9693221Z         if scale_ub is not None:
2025-05-07T20:32:17.9693330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9693466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9693549Z             )
2025-05-07T20:32:17.9693628Z         else:
2025-05-07T20:32:17.9693723Z             scale_ub_tensor = None
2025-05-07T20:32:17.9693807Z     
2025-05-07T20:32:17.9693940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9694034Z             op = silu_mul_quant
2025-05-07T20:32:17.9694126Z             if compiled:
2025-05-07T20:32:17.9694225Z                 op = torch.compile(op)
2025-05-07T20:32:17.9694336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9694417Z     
2025-05-07T20:32:17.9694509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9694513Z 
2025-05-07T20:32:17.9694617Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9694745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9694850Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9694957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9695452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9695548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9695912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9696137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9696560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9696659Z     kernel = self.compile(
2025-05-07T20:32:17.9697039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9697220Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9697348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9697352Z 
2025-05-07T20:32:17.9697565Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9ac5580>
2025-05-07T20:32:17.9698333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9698835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9b488b0>}
2025-05-07T20:32:17.9699602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9699796Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9aee6f0>
2025-05-07T20:32:17.9699840Z 
2025-05-07T20:32:17.9700015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9700278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9700387Z                            module_map=module_map)
2025-05-07T20:32:17.9700556Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9700654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9700741Z E       ^
2025-05-07T20:32:17.9701147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9701204Z 
2025-05-07T20:32:17.9701626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9701631Z 
2025-05-07T20:32:17.9701739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9701964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9702050Z     T=16384,
2025-05-07T20:32:17.9702126Z     D=5120,
2025-05-07T20:32:17.9702209Z     scale_ub=1200.0,
2025-05-07T20:32:17.9702300Z     contiguous=True,
2025-05-07T20:32:17.9702383Z     compiled=True,
2025-05-07T20:32:17.9702456Z )
2025-05-07T20:32:17.9702678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9702856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9702861Z 
2025-05-07T20:32:17.9702954Z     @given(
2025-05-07T20:32:17.9703093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9703219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9703343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9703460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9703573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9703658Z     )
2025-05-07T20:32:17.9703904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9703998Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9704082Z         self,
2025-05-07T20:32:17.9704158Z         T: int,
2025-05-07T20:32:17.9704234Z         D: int,
2025-05-07T20:32:17.9704338Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9704426Z         contiguous: bool,
2025-05-07T20:32:17.9704516Z         compiled: bool,
2025-05-07T20:32:17.9704601Z     ) -> None:
2025-05-07T20:32:17.9704695Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9704768Z     
2025-05-07T20:32:17.9705046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9705125Z     
2025-05-07T20:32:17.9705224Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9705348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9705437Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9705528Z         x0 = x[:, :D]
2025-05-07T20:32:17.9705610Z         x1 = x[:, D:]
2025-05-07T20:32:17.9705684Z     
2025-05-07T20:32:17.9705777Z         if contiguous:
2025-05-07T20:32:17.9705872Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9705962Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9706042Z     
2025-05-07T20:32:17.9706133Z         if scale_ub is not None:
2025-05-07T20:32:17.9706243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9706383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9706461Z             )
2025-05-07T20:32:17.9706552Z         else:
2025-05-07T20:32:17.9706648Z             scale_ub_tensor = None
2025-05-07T20:32:17.9706725Z     
2025-05-07T20:32:17.9706870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9706961Z             op = silu_mul_quant
2025-05-07T20:32:17.9707047Z             if compiled:
2025-05-07T20:32:17.9707154Z                 op = torch.compile(op)
2025-05-07T20:32:17.9707261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9707382Z     
2025-05-07T20:32:17.9707479Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9707483Z 
2025-05-07T20:32:17.9707583Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9707718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9707822Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9707922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9708293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9708387Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9708883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9709029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9709395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9709628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9709966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9710059Z     kernel = self.compile(
2025-05-07T20:32:17.9710445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9710620Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9710746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9710758Z 
2025-05-07T20:32:17.9710969Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9b0b250>
2025-05-07T20:32:17.9711739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9712261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9a215e0>}
2025-05-07T20:32:17.9713051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9713247Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9a7c830>
2025-05-07T20:32:17.9713252Z 
2025-05-07T20:32:17.9713416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9713757Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9713877Z                            module_map=module_map)
2025-05-07T20:32:17.9714039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9714138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9714225Z E       ^
2025-05-07T20:32:17.9714575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9714581Z 
2025-05-07T20:32:17.9715003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9715008Z 
2025-05-07T20:32:17.9715115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9715336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9715419Z     T=16384,
2025-05-07T20:32:17.9715497Z     D=5120,
2025-05-07T20:32:17.9715594Z     scale_ub=None,
2025-05-07T20:32:17.9715682Z     contiguous=False,
2025-05-07T20:32:17.9715765Z     compiled=True,
2025-05-07T20:32:17.9715845Z )
2025-05-07T20:32:17.9716062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9716239Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9716285Z 
2025-05-07T20:32:17.9716372Z     @given(
2025-05-07T20:32:17.9716490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9716590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9716712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9716832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9716954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9717029Z     )
2025-05-07T20:32:17.9717274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9717425Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9717502Z         self,
2025-05-07T20:32:17.9717581Z         T: int,
2025-05-07T20:32:17.9717666Z         D: int,
2025-05-07T20:32:17.9717764Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9717854Z         contiguous: bool,
2025-05-07T20:32:17.9717947Z         compiled: bool,
2025-05-07T20:32:17.9718027Z     ) -> None:
2025-05-07T20:32:17.9718122Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9718203Z     
2025-05-07T20:32:17.9718371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9718445Z     
2025-05-07T20:32:17.9718544Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9718671Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9718769Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9718850Z         x0 = x[:, :D]
2025-05-07T20:32:17.9718930Z         x1 = x[:, D:]
2025-05-07T20:32:17.9719009Z     
2025-05-07T20:32:17.9719095Z         if contiguous:
2025-05-07T20:32:17.9719193Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9719292Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9719364Z     
2025-05-07T20:32:17.9719456Z         if scale_ub is not None:
2025-05-07T20:32:17.9719570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9719705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9719786Z             )
2025-05-07T20:32:17.9719872Z         else:
2025-05-07T20:32:17.9719967Z             scale_ub_tensor = None
2025-05-07T20:32:17.9720049Z     
2025-05-07T20:32:17.9720180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9720271Z             op = silu_mul_quant
2025-05-07T20:32:17.9720367Z             if compiled:
2025-05-07T20:32:17.9720469Z                 op = torch.compile(op)
2025-05-07T20:32:17.9720575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9720654Z     
2025-05-07T20:32:17.9720747Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9720751Z 
2025-05-07T20:32:17.9720933Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9721068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9721170Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9721277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9721641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9721738Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9722239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9722337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9722702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9722958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9723326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9723433Z     kernel = self.compile(
2025-05-07T20:32:17.9723811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9723989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9724163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9724167Z 
2025-05-07T20:32:17.9724373Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a82f10>
2025-05-07T20:32:17.9725153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9725662Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9c405e0>}
2025-05-07T20:32:17.9726463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9726669Z context = <triton._C.libtriton.ir.context object at 0x7fd7e997f570>
2025-05-07T20:32:17.9726673Z 
2025-05-07T20:32:17.9726839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9727114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9727225Z                            module_map=module_map)
2025-05-07T20:32:17.9727386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9727490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9727569Z E       ^
2025-05-07T20:32:17.9727932Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9727944Z 
2025-05-07T20:32:17.9728364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9728368Z 
2025-05-07T20:32:17.9728471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9728702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9728779Z     T=2048,
2025-05-07T20:32:17.9728856Z     D=5120,
2025-05-07T20:32:17.9728944Z     scale_ub=None,
2025-05-07T20:32:17.9729032Z     contiguous=False,
2025-05-07T20:32:17.9729115Z     compiled=True,
2025-05-07T20:32:17.9729194Z )
2025-05-07T20:32:17.9729410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9729592Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9729597Z 
2025-05-07T20:32:17.9729674Z     @given(
2025-05-07T20:32:17.9729874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9729984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9730102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9730217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9730337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9730414Z     )
2025-05-07T20:32:17.9730665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9730761Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9730837Z         self,
2025-05-07T20:32:17.9730921Z         T: int,
2025-05-07T20:32:17.9730998Z         D: int,
2025-05-07T20:32:17.9731096Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9731192Z         contiguous: bool,
2025-05-07T20:32:17.9731279Z         compiled: bool,
2025-05-07T20:32:17.9731358Z     ) -> None:
2025-05-07T20:32:17.9731458Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9731531Z     
2025-05-07T20:32:17.9731711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9731793Z     
2025-05-07T20:32:17.9731885Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9732010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9732106Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9732234Z         x0 = x[:, :D]
2025-05-07T20:32:17.9732321Z         x1 = x[:, D:]
2025-05-07T20:32:17.9732394Z     
2025-05-07T20:32:17.9732478Z         if contiguous:
2025-05-07T20:32:17.9732577Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9732666Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9732740Z     
2025-05-07T20:32:17.9732838Z         if scale_ub is not None:
2025-05-07T20:32:17.9732954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9733110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9733213Z             )
2025-05-07T20:32:17.9733291Z         else:
2025-05-07T20:32:17.9733390Z             scale_ub_tensor = None
2025-05-07T20:32:17.9733512Z     
2025-05-07T20:32:17.9738903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9739020Z             op = silu_mul_quant
2025-05-07T20:32:17.9739114Z             if compiled:
2025-05-07T20:32:17.9739237Z                 op = torch.compile(op)
2025-05-07T20:32:17.9739355Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9739434Z     
2025-05-07T20:32:17.9739537Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9739543Z 
2025-05-07T20:32:17.9739645Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9739791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9739899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9740007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9740753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9740854Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9741436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9741551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9741919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9742155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9742496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9742591Z     kernel = self.compile(
2025-05-07T20:32:17.9742985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9743169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9743302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9743523Z 
2025-05-07T20:32:17.9743740Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a67370>
2025-05-07T20:32:17.9744518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9745043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9a21c10>}
2025-05-07T20:32:17.9745798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9746000Z context = <triton._C.libtriton.ir.context object at 0x7fd7e99c9670>
2025-05-07T20:32:17.9746005Z 
2025-05-07T20:32:17.9746178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9746454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9746573Z                            module_map=module_map)
2025-05-07T20:32:17.9746738Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9746907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9746987Z E       ^
2025-05-07T20:32:17.9747345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9747350Z 
2025-05-07T20:32:17.9747772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9747776Z 
2025-05-07T20:32:17.9747885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9748108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9748256Z     T=2048,
2025-05-07T20:32:17.9748341Z     D=5120,
2025-05-07T20:32:17.9748437Z     scale_ub=1200.0,
2025-05-07T20:32:17.9748526Z     contiguous=False,
2025-05-07T20:32:17.9748612Z     compiled=True,
2025-05-07T20:32:17.9748693Z )
2025-05-07T20:32:17.9748915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9749096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9749101Z 
2025-05-07T20:32:17.9749187Z     @given(
2025-05-07T20:32:17.9749308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9749408Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9749533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9749651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9749775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9749851Z     )
2025-05-07T20:32:17.9750106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9750212Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9750290Z         self,
2025-05-07T20:32:17.9750368Z         T: int,
2025-05-07T20:32:17.9750455Z         D: int,
2025-05-07T20:32:17.9750556Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9750647Z         contiguous: bool,
2025-05-07T20:32:17.9750745Z         compiled: bool,
2025-05-07T20:32:17.9750831Z     ) -> None:
2025-05-07T20:32:17.9750932Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9751013Z     
2025-05-07T20:32:17.9751186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9751269Z     
2025-05-07T20:32:17.9751367Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9751496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9751595Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9751681Z         x0 = x[:, :D]
2025-05-07T20:32:17.9751763Z         x1 = x[:, D:]
2025-05-07T20:32:17.9751845Z     
2025-05-07T20:32:17.9752021Z         if contiguous:
2025-05-07T20:32:17.9752117Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9752218Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9752292Z     
2025-05-07T20:32:17.9752385Z         if scale_ub is not None:
2025-05-07T20:32:17.9752501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9752643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9752729Z             )
2025-05-07T20:32:17.9752810Z         else:
2025-05-07T20:32:17.9752907Z             scale_ub_tensor = None
2025-05-07T20:32:17.9752991Z     
2025-05-07T20:32:17.9753125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9753216Z             op = silu_mul_quant
2025-05-07T20:32:17.9753311Z             if compiled:
2025-05-07T20:32:17.9753413Z                 op = torch.compile(op)
2025-05-07T20:32:17.9753520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9753600Z     
2025-05-07T20:32:17.9753691Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9753705Z 
2025-05-07T20:32:17.9753806Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9753946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9754049Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9754164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9754580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9754676Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9755180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9755282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9755640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9755876Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9756270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9756375Z     kernel = self.compile(
2025-05-07T20:32:17.9756758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9756941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9757077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9757081Z 
2025-05-07T20:32:17.9757293Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e990cd60>
2025-05-07T20:32:17.9758074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9758580Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e98e8820>}
2025-05-07T20:32:17.9759340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9759544Z context = <triton._C.libtriton.ir.context object at 0x7fd7e993e470>
2025-05-07T20:32:17.9759548Z 
2025-05-07T20:32:17.9759719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9759992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9760104Z                            module_map=module_map)
2025-05-07T20:32:17.9760270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9760378Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9760457Z E       ^
2025-05-07T20:32:17.9760901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9760906Z 
2025-05-07T20:32:17.9761327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9761334Z 
2025-05-07T20:32:17.9761439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9761671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9761751Z     T=4096,
2025-05-07T20:32:17.9761829Z     D=5120,
2025-05-07T20:32:17.9761926Z     scale_ub=1200.0,
2025-05-07T20:32:17.9762013Z     contiguous=True,
2025-05-07T20:32:17.9762105Z     compiled=True,
2025-05-07T20:32:17.9762180Z )
2025-05-07T20:32:17.9762400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9762584Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9762589Z 
2025-05-07T20:32:17.9762677Z     @given(
2025-05-07T20:32:17.9762798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9762907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9763024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9763143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9763304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9763380Z     )
2025-05-07T20:32:17.9763637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9763735Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9763814Z         self,
2025-05-07T20:32:17.9763899Z         T: int,
2025-05-07T20:32:17.9763978Z         D: int,
2025-05-07T20:32:17.9764078Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9764177Z         contiguous: bool,
2025-05-07T20:32:17.9764265Z         compiled: bool,
2025-05-07T20:32:17.9764346Z     ) -> None:
2025-05-07T20:32:17.9764452Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9764574Z     
2025-05-07T20:32:17.9764747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9764832Z     
2025-05-07T20:32:17.9764926Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9765061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9765157Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9765240Z         x0 = x[:, :D]
2025-05-07T20:32:17.9765331Z         x1 = x[:, D:]
2025-05-07T20:32:17.9765410Z     
2025-05-07T20:32:17.9765496Z         if contiguous:
2025-05-07T20:32:17.9765597Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9765689Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9765764Z     
2025-05-07T20:32:17.9765867Z         if scale_ub is not None:
2025-05-07T20:32:17.9765977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9766117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9766205Z             )
2025-05-07T20:32:17.9766288Z         else:
2025-05-07T20:32:17.9766396Z             scale_ub_tensor = None
2025-05-07T20:32:17.9766473Z     
2025-05-07T20:32:17.9766606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9766706Z             op = silu_mul_quant
2025-05-07T20:32:17.9766796Z             if compiled:
2025-05-07T20:32:17.9766905Z                 op = torch.compile(op)
2025-05-07T20:32:17.9767019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9767093Z     
2025-05-07T20:32:17.9767184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9767189Z 
2025-05-07T20:32:17.9767299Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9767429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9767531Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9767643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9768010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9768192Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9768699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9768797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9769167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9769399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9769748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9769844Z     kernel = self.compile(
2025-05-07T20:32:17.9770224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9770409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9770545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9770551Z 
2025-05-07T20:32:17.9770758Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e99cdc10>
2025-05-07T20:32:17.9771552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9772105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9ddd430>}
2025-05-07T20:32:17.9772869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9773063Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9dce670>
2025-05-07T20:32:17.9773129Z 
2025-05-07T20:32:17.9773318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9773590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9773703Z                            module_map=module_map)
2025-05-07T20:32:17.9773882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9773984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9774065Z E       ^
2025-05-07T20:32:17.9774424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9774429Z 
2025-05-07T20:32:17.9774838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9774843Z 
2025-05-07T20:32:17.9774952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9775181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9775263Z     T=128,
2025-05-07T20:32:17.9775349Z     D=5120,
2025-05-07T20:32:17.9775436Z     scale_ub=1200.0,
2025-05-07T20:32:17.9775524Z     contiguous=False,
2025-05-07T20:32:17.9775616Z     compiled=True,
2025-05-07T20:32:17.9775689Z )
2025-05-07T20:32:17.9775912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9776089Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9776094Z 
2025-05-07T20:32:17.9776172Z     @given(
2025-05-07T20:32:17.9776300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9776398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9776514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9776636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9776748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9776824Z     )
2025-05-07T20:32:17.9777154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9777250Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9777333Z         self,
2025-05-07T20:32:17.9777411Z         T: int,
2025-05-07T20:32:17.9777485Z         D: int,
2025-05-07T20:32:17.9777590Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9777687Z         contiguous: bool,
2025-05-07T20:32:17.9777774Z         compiled: bool,
2025-05-07T20:32:17.9777851Z     ) -> None:
2025-05-07T20:32:17.9777950Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9778025Z     
2025-05-07T20:32:17.9778195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9778276Z     
2025-05-07T20:32:17.9778366Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9778495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9778589Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9778670Z         x0 = x[:, :D]
2025-05-07T20:32:17.9778756Z         x1 = x[:, D:]
2025-05-07T20:32:17.9778837Z     
2025-05-07T20:32:17.9778922Z         if contiguous:
2025-05-07T20:32:17.9779020Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9779109Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9779183Z     
2025-05-07T20:32:17.9779280Z         if scale_ub is not None:
2025-05-07T20:32:17.9779428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9779566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9779651Z             )
2025-05-07T20:32:17.9779729Z         else:
2025-05-07T20:32:17.9779822Z             scale_ub_tensor = None
2025-05-07T20:32:17.9779901Z     
2025-05-07T20:32:17.9780031Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9780125Z             op = silu_mul_quant
2025-05-07T20:32:17.9780210Z             if compiled:
2025-05-07T20:32:17.9780310Z                 op = torch.compile(op)
2025-05-07T20:32:17.9780423Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9780538Z     
2025-05-07T20:32:17.9780637Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9780642Z 
2025-05-07T20:32:17.9780744Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9780872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9780973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9781164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9781535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9781633Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9782125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9782222Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9782583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9782813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9783164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9783259Z     kernel = self.compile(
2025-05-07T20:32:17.9783638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9783823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9783951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9783955Z 
2025-05-07T20:32:17.9784160Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9df0be0>
2025-05-07T20:32:17.9784942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9785526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e982b040>}
2025-05-07T20:32:17.9786283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9786477Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9822230>
2025-05-07T20:32:17.9786481Z 
2025-05-07T20:32:17.9786650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9786911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9787020Z                            module_map=module_map)
2025-05-07T20:32:17.9787186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9787284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9787370Z E       ^
2025-05-07T20:32:17.9787732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9787737Z 
2025-05-07T20:32:17.9788147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9788189Z 
2025-05-07T20:32:17.9788300Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9788522Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9788600Z     T=16384,
2025-05-07T20:32:17.9788681Z     D=7168,
2025-05-07T20:32:17.9788763Z     scale_ub=1200.0,
2025-05-07T20:32:17.9788847Z     contiguous=True,
2025-05-07T20:32:17.9788936Z     compiled=True,
2025-05-07T20:32:17.9789009Z )
2025-05-07T20:32:17.9789226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9789406Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9789456Z 
2025-05-07T20:32:17.9789534Z     @given(
2025-05-07T20:32:17.9789659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9789758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9789873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9789996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9790109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9790183Z     )
2025-05-07T20:32:17.9790435Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9790532Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9790613Z         self,
2025-05-07T20:32:17.9790690Z         T: int,
2025-05-07T20:32:17.9790766Z         D: int,
2025-05-07T20:32:17.9790869Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9790957Z         contiguous: bool,
2025-05-07T20:32:17.9791044Z         compiled: bool,
2025-05-07T20:32:17.9791125Z     ) -> None:
2025-05-07T20:32:17.9791230Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9791304Z     
2025-05-07T20:32:17.9791486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9791561Z     
2025-05-07T20:32:17.9791653Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9791782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9791875Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9791961Z         x0 = x[:, :D]
2025-05-07T20:32:17.9792043Z         x1 = x[:, D:]
2025-05-07T20:32:17.9792116Z     
2025-05-07T20:32:17.9792206Z         if contiguous:
2025-05-07T20:32:17.9792298Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9792386Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9792464Z     
2025-05-07T20:32:17.9792556Z         if scale_ub is not None:
2025-05-07T20:32:17.9792666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9792810Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9792968Z             )
2025-05-07T20:32:17.9793046Z         else:
2025-05-07T20:32:17.9793147Z             scale_ub_tensor = None
2025-05-07T20:32:17.9793219Z     
2025-05-07T20:32:17.9793350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9793448Z             op = silu_mul_quant
2025-05-07T20:32:17.9793538Z             if compiled:
2025-05-07T20:32:17.9793645Z                 op = torch.compile(op)
2025-05-07T20:32:17.9793749Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9793820Z     
2025-05-07T20:32:17.9793915Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9793920Z 
2025-05-07T20:32:17.9794016Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9794144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9794251Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9794350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9794728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9794831Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9795324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9795423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9795826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9796051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9796395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9796488Z     kernel = self.compile(
2025-05-07T20:32:17.9796871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9797045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9797220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9797225Z 
2025-05-07T20:32:17.9797435Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98289a0>
2025-05-07T20:32:17.9798205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9798722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e982bb80>}
2025-05-07T20:32:17.9799460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9799655Z context = <triton._C.libtriton.ir.context object at 0x7fd7e974a130>
2025-05-07T20:32:17.9799662Z 
2025-05-07T20:32:17.9799835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9800103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9800223Z                            module_map=module_map)
2025-05-07T20:32:17.9800384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9800486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9800573Z E       ^
2025-05-07T20:32:17.9800926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9800931Z 
2025-05-07T20:32:17.9801347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9801351Z 
2025-05-07T20:32:17.9801456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9801752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9801837Z     T=16384,
2025-05-07T20:32:17.9801914Z     D=5120,
2025-05-07T20:32:17.9801996Z     scale_ub=1200.0,
2025-05-07T20:32:17.9802089Z     contiguous=True,
2025-05-07T20:32:17.9802173Z     compiled=False,
2025-05-07T20:32:17.9802249Z )
2025-05-07T20:32:17.9802470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9802655Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9802659Z 
2025-05-07T20:32:17.9802740Z     @given(
2025-05-07T20:32:17.9802863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9802964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9803079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9803204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9803321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9803409Z     )
2025-05-07T20:32:17.9803657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9803754Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9803836Z         self,
2025-05-07T20:32:17.9803913Z         T: int,
2025-05-07T20:32:17.9803989Z         D: int,
2025-05-07T20:32:17.9804159Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9804251Z         contiguous: bool,
2025-05-07T20:32:17.9804340Z         compiled: bool,
2025-05-07T20:32:17.9804423Z     ) -> None:
2025-05-07T20:32:17.9804518Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9804591Z     
2025-05-07T20:32:17.9804765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9804842Z     
2025-05-07T20:32:17.9804937Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9805063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9805151Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9805238Z         x0 = x[:, :D]
2025-05-07T20:32:17.9805367Z         x1 = x[:, D:]
2025-05-07T20:32:17.9805439Z     
2025-05-07T20:32:17.9805528Z         if contiguous:
2025-05-07T20:32:17.9805621Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9805714Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9805795Z     
2025-05-07T20:32:17.9805890Z         if scale_ub is not None:
2025-05-07T20:32:17.9805995Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9806135Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9806211Z             )
2025-05-07T20:32:17.9806288Z         else:
2025-05-07T20:32:17.9806387Z             scale_ub_tensor = None
2025-05-07T20:32:17.9806460Z     
2025-05-07T20:32:17.9806593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9806683Z             op = silu_mul_quant
2025-05-07T20:32:17.9806767Z             if compiled:
2025-05-07T20:32:17.9806870Z                 op = torch.compile(op)
2025-05-07T20:32:17.9806979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9807055Z     
2025-05-07T20:32:17.9807152Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9807157Z 
2025-05-07T20:32:17.9807256Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9807382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9807492Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9807591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9808102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9808198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9808557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9808785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9809200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9809307Z     kernel = self.compile(
2025-05-07T20:32:17.9809687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9809861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9809993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9809998Z 
2025-05-07T20:32:17.9810202Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9838b20>
2025-05-07T20:32:17.9810973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9811486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97835e0>}
2025-05-07T20:32:17.9812240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9812435Z context = <triton._C.libtriton.ir.context object at 0x7fd7e978a470>
2025-05-07T20:32:17.9812479Z 
2025-05-07T20:32:17.9812648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9812913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9813020Z                            module_map=module_map)
2025-05-07T20:32:17.9813181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9813284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9813362Z E       ^
2025-05-07T20:32:17.9813720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9813765Z 
2025-05-07T20:32:17.9814190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9814195Z 
2025-05-07T20:32:17.9814298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9814525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9814601Z     T=1,
2025-05-07T20:32:17.9814676Z     D=7168,
2025-05-07T20:32:17.9814762Z     scale_ub=1200.0,
2025-05-07T20:32:17.9814848Z     contiguous=False,
2025-05-07T20:32:17.9814931Z     compiled=False,
2025-05-07T20:32:17.9815011Z )
2025-05-07T20:32:17.9815227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9815395Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9815404Z 
2025-05-07T20:32:17.9815479Z     @given(
2025-05-07T20:32:17.9815601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9815708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9815822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9815939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9816054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9816130Z     )
2025-05-07T20:32:17.9816374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9816472Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9816551Z         self,
2025-05-07T20:32:17.9816632Z         T: int,
2025-05-07T20:32:17.9816708Z         D: int,
2025-05-07T20:32:17.9816808Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9816903Z         contiguous: bool,
2025-05-07T20:32:17.9816988Z         compiled: bool,
2025-05-07T20:32:17.9817066Z     ) -> None:
2025-05-07T20:32:17.9817165Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9817237Z     
2025-05-07T20:32:17.9817493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9817575Z     
2025-05-07T20:32:17.9817667Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9817792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9817888Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9817969Z         x0 = x[:, :D]
2025-05-07T20:32:17.9818052Z         x1 = x[:, D:]
2025-05-07T20:32:17.9818127Z     
2025-05-07T20:32:17.9818211Z         if contiguous:
2025-05-07T20:32:17.9818307Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9818397Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9818471Z     
2025-05-07T20:32:17.9818565Z         if scale_ub is not None:
2025-05-07T20:32:17.9818669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9818807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9818888Z             )
2025-05-07T20:32:17.9818965Z         else:
2025-05-07T20:32:17.9819058Z             scale_ub_tensor = None
2025-05-07T20:32:17.9819146Z     
2025-05-07T20:32:17.9819276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9819367Z             op = silu_mul_quant
2025-05-07T20:32:17.9819455Z             if compiled:
2025-05-07T20:32:17.9819554Z                 op = torch.compile(op)
2025-05-07T20:32:17.9819663Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9819780Z     
2025-05-07T20:32:17.9819872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9819877Z 
2025-05-07T20:32:17.9819977Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9820105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9820206Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9820311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9820807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9820902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9821373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9821601Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9821945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9822042Z     kernel = self.compile(
2025-05-07T20:32:17.9822422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9822601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9822724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9822729Z 
2025-05-07T20:32:17.9822941Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9a09700>
2025-05-07T20:32:17.9823712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9824225Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97839d0>}
2025-05-07T20:32:17.9824982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9825174Z context = <triton._C.libtriton.ir.context object at 0x7fd7e96a3b70>
2025-05-07T20:32:17.9825179Z 
2025-05-07T20:32:17.9825350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9825611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9825804Z                            module_map=module_map)
2025-05-07T20:32:17.9825976Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9826075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9826158Z E       ^
2025-05-07T20:32:17.9826508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9826515Z 
2025-05-07T20:32:17.9826924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9826928Z 
2025-05-07T20:32:17.9827036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9827256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9827337Z     T=4096,
2025-05-07T20:32:17.9827413Z     D=7168,
2025-05-07T20:32:17.9827496Z     scale_ub=1200.0,
2025-05-07T20:32:17.9827586Z     contiguous=False,
2025-05-07T20:32:17.9827669Z     compiled=True,
2025-05-07T20:32:17.9827745Z )
2025-05-07T20:32:17.9827971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9828146Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9828151Z 
2025-05-07T20:32:17.9828228Z     @given(
2025-05-07T20:32:17.9828353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9828493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9828609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9828725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9828837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9828914Z     )
2025-05-07T20:32:17.9829159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9829253Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9829333Z         self,
2025-05-07T20:32:17.9829408Z         T: int,
2025-05-07T20:32:17.9829485Z         D: int,
2025-05-07T20:32:17.9829634Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9829723Z         contiguous: bool,
2025-05-07T20:32:17.9829809Z         compiled: bool,
2025-05-07T20:32:17.9829891Z     ) -> None:
2025-05-07T20:32:17.9829985Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9830063Z     
2025-05-07T20:32:17.9830233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9830307Z     
2025-05-07T20:32:17.9830404Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9830530Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9830619Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9830704Z         x0 = x[:, :D]
2025-05-07T20:32:17.9830784Z         x1 = x[:, D:]
2025-05-07T20:32:17.9830856Z     
2025-05-07T20:32:17.9830943Z         if contiguous:
2025-05-07T20:32:17.9831033Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9831121Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9831195Z     
2025-05-07T20:32:17.9831290Z         if scale_ub is not None:
2025-05-07T20:32:17.9831397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9831540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9831616Z             )
2025-05-07T20:32:17.9831698Z         else:
2025-05-07T20:32:17.9831795Z             scale_ub_tensor = None
2025-05-07T20:32:17.9831870Z     
2025-05-07T20:32:17.9832004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9832092Z             op = silu_mul_quant
2025-05-07T20:32:17.9832176Z             if compiled:
2025-05-07T20:32:17.9832280Z                 op = torch.compile(op)
2025-05-07T20:32:17.9832385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9832457Z     
2025-05-07T20:32:17.9832550Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9832554Z 
2025-05-07T20:32:17.9832650Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9832781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9832984Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9833087Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9833462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9833555Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9834049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9834146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9834501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9834728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9835064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9835158Z     kernel = self.compile(
2025-05-07T20:32:17.9835556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9835733Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9835857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9835905Z 
2025-05-07T20:32:17.9836114Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e974d550>
2025-05-07T20:32:17.9836882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9837386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9696c10>}
2025-05-07T20:32:17.9838130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9838363Z context = <triton._C.libtriton.ir.context object at 0x7fd7e98a5af0>
2025-05-07T20:32:17.9838368Z 
2025-05-07T20:32:17.9838530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9838793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9838904Z                            module_map=module_map)
2025-05-07T20:32:17.9839067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9839164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9839244Z E       ^
2025-05-07T20:32:17.9839596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9839601Z 
2025-05-07T20:32:17.9840026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9840032Z 
2025-05-07T20:32:17.9840574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9840802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9840881Z     T=128,
2025-05-07T20:32:17.9840962Z     D=7168,
2025-05-07T20:32:17.9841049Z     scale_ub=1200.0,
2025-05-07T20:32:17.9841133Z     contiguous=False,
2025-05-07T20:32:17.9841215Z     compiled=True,
2025-05-07T20:32:17.9841289Z )
2025-05-07T20:32:17.9841504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9841674Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.9841679Z 
2025-05-07T20:32:17.9841760Z     @given(
2025-05-07T20:32:17.9841876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9841973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9842294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9842420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9842539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9842614Z     )
2025-05-07T20:32:17.9842860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9842965Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9843038Z         self,
2025-05-07T20:32:17.9843112Z         T: int,
2025-05-07T20:32:17.9843193Z         D: int,
2025-05-07T20:32:17.9843291Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9843380Z         contiguous: bool,
2025-05-07T20:32:17.9843470Z         compiled: bool,
2025-05-07T20:32:17.9843547Z     ) -> None:
2025-05-07T20:32:17.9843640Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9843717Z     
2025-05-07T20:32:17.9843883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9843962Z     
2025-05-07T20:32:17.9844054Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9844189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9844279Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9844360Z         x0 = x[:, :D]
2025-05-07T20:32:17.9844438Z         x1 = x[:, D:]
2025-05-07T20:32:17.9844514Z     
2025-05-07T20:32:17.9844596Z         if contiguous:
2025-05-07T20:32:17.9844748Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9844846Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9844919Z     
2025-05-07T20:32:17.9845008Z         if scale_ub is not None:
2025-05-07T20:32:17.9845118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9845253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9845328Z             )
2025-05-07T20:32:17.9845409Z         else:
2025-05-07T20:32:17.9845503Z             scale_ub_tensor = None
2025-05-07T20:32:17.9845578Z     
2025-05-07T20:32:17.9845708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9845864Z             op = silu_mul_quant
2025-05-07T20:32:17.9845953Z             if compiled:
2025-05-07T20:32:17.9846052Z                 op = torch.compile(op)
2025-05-07T20:32:17.9846155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9846229Z     
2025-05-07T20:32:17.9846319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9846326Z 
2025-05-07T20:32:17.9846424Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9846553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9846652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9846755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9847125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9847219Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9847714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9847815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9848168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9848394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9848738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9848837Z     kernel = self.compile(
2025-05-07T20:32:17.9849214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9849390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9849519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9849523Z 
2025-05-07T20:32:17.9849729Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98c9910>
2025-05-07T20:32:17.9850576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9851091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e98a9820>}
2025-05-07T20:32:17.9851842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9852035Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9704c30>
2025-05-07T20:32:17.9852039Z 
2025-05-07T20:32:17.9852202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9852466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9852582Z                            module_map=module_map)
2025-05-07T20:32:17.9852741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9852844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9852921Z E       ^
2025-05-07T20:32:17.9853272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9853323Z 
2025-05-07T20:32:17.9853740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9853744Z 
2025-05-07T20:32:17.9853847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9854074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9854149Z     T=2048,
2025-05-07T20:32:17.9854222Z     D=7168,
2025-05-07T20:32:17.9854308Z     scale_ub=None,
2025-05-07T20:32:17.9854391Z     contiguous=True,
2025-05-07T20:32:17.9854521Z     compiled=True,
2025-05-07T20:32:17.9854598Z )
2025-05-07T20:32:17.9854816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9854987Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9854992Z 
2025-05-07T20:32:17.9855068Z     @given(
2025-05-07T20:32:17.9855187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9855292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9855404Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9855519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9855637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9855709Z     )
2025-05-07T20:32:17.9855957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9856048Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9856124Z         self,
2025-05-07T20:32:17.9856205Z         T: int,
2025-05-07T20:32:17.9856287Z         D: int,
2025-05-07T20:32:17.9856384Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9856475Z         contiguous: bool,
2025-05-07T20:32:17.9856559Z         compiled: bool,
2025-05-07T20:32:17.9856636Z     ) -> None:
2025-05-07T20:32:17.9856734Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9856808Z     
2025-05-07T20:32:17.9856977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9857053Z     
2025-05-07T20:32:17.9857142Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9857269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9857360Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9857443Z         x0 = x[:, :D]
2025-05-07T20:32:17.9857525Z         x1 = x[:, D:]
2025-05-07T20:32:17.9857595Z     
2025-05-07T20:32:17.9857676Z         if contiguous:
2025-05-07T20:32:17.9857768Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9857855Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9857931Z     
2025-05-07T20:32:17.9858105Z         if scale_ub is not None:
2025-05-07T20:32:17.9858211Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9858347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9858427Z             )
2025-05-07T20:32:17.9858502Z         else:
2025-05-07T20:32:17.9858598Z             scale_ub_tensor = None
2025-05-07T20:32:17.9858674Z     
2025-05-07T20:32:17.9858802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9858894Z             op = silu_mul_quant
2025-05-07T20:32:17.9858976Z             if compiled:
2025-05-07T20:32:17.9859075Z                 op = torch.compile(op)
2025-05-07T20:32:17.9859180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9859250Z     
2025-05-07T20:32:17.9859339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9859343Z 
2025-05-07T20:32:17.9859442Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9859573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9859674Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9859776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9863750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.9863936Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.9864451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9864551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9864915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9865140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9865475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9865575Z     kernel = self.compile(
2025-05-07T20:32:17.9866085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9866279Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9866418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9866431Z 
2025-05-07T20:32:17.9866665Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e98a78b0>
2025-05-07T20:32:17.9867645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9868273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e97b54c0>}
2025-05-07T20:32:17.9869207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9869432Z context = <triton._C.libtriton.ir.context object at 0x7fd7e978ec30>
2025-05-07T20:32:17.9869436Z 
2025-05-07T20:32:17.9869620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9869930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9870046Z                            module_map=module_map)
2025-05-07T20:32:17.9870222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9870328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9870404Z E       ^
2025-05-07T20:32:17.9870831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9870835Z 
2025-05-07T20:32:17.9871445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9871451Z 
2025-05-07T20:32:17.9871561Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9871783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9871867Z     T=16384,
2025-05-07T20:32:17.9871941Z     D=5120,
2025-05-07T20:32:17.9872027Z     scale_ub=None,
2025-05-07T20:32:17.9872114Z     contiguous=False,
2025-05-07T20:32:17.9872196Z     compiled=False,
2025-05-07T20:32:17.9872270Z )
2025-05-07T20:32:17.9872487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9872666Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9872671Z 
2025-05-07T20:32:17.9872749Z     @given(
2025-05-07T20:32:17.9872866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9872963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9873093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9873207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9873327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9873398Z     )
2025-05-07T20:32:17.9873643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9873779Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9873854Z         self,
2025-05-07T20:32:17.9873930Z         T: int,
2025-05-07T20:32:17.9874007Z         D: int,
2025-05-07T20:32:17.9874106Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9874192Z         contiguous: bool,
2025-05-07T20:32:17.9874281Z         compiled: bool,
2025-05-07T20:32:17.9874361Z     ) -> None:
2025-05-07T20:32:17.9874454Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9874532Z     
2025-05-07T20:32:17.9874701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9874818Z     
2025-05-07T20:32:17.9874912Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9875038Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9876851Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9876860Z 
2025-05-07T20:32:17.9876978Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.9876983Z 
2025-05-07T20:32:17.9877089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9877312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9877393Z     T=4096,
2025-05-07T20:32:17.9877471Z     D=7168,
2025-05-07T20:32:17.9877551Z     scale_ub=1200.0,
2025-05-07T20:32:17.9877634Z     contiguous=True,
2025-05-07T20:32:17.9877719Z     compiled=True,
2025-05-07T20:32:17.9877790Z )
2025-05-07T20:32:17.9878010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9878180Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9878185Z 
2025-05-07T20:32:17.9878259Z     @given(
2025-05-07T20:32:17.9878380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9878476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9878589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9878705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9878814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9878885Z     )
2025-05-07T20:32:17.9879217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9879310Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9879387Z         self,
2025-05-07T20:32:17.9879461Z         T: int,
2025-05-07T20:32:17.9879537Z         D: int,
2025-05-07T20:32:17.9879635Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9879724Z         contiguous: bool,
2025-05-07T20:32:17.9879808Z         compiled: bool,
2025-05-07T20:32:17.9879888Z     ) -> None:
2025-05-07T20:32:17.9879980Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9880054Z     
2025-05-07T20:32:17.9880225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9880297Z     
2025-05-07T20:32:17.9880385Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9880511Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9882271Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9882326Z 
2025-05-07T20:32:17.9882445Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.9882449Z 
2025-05-07T20:32:17.9882549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9882774Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9882850Z     T=16384,
2025-05-07T20:32:17.9882926Z     D=7168,
2025-05-07T20:32:17.9883012Z     scale_ub=None,
2025-05-07T20:32:17.9883096Z     contiguous=False,
2025-05-07T20:32:17.9883178Z     compiled=False,
2025-05-07T20:32:17.9883294Z )
2025-05-07T20:32:17.9883513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9883691Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9883701Z 
2025-05-07T20:32:17.9883778Z     @given(
2025-05-07T20:32:17.9883892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9883994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9884104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9884220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9884331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9884402Z     )
2025-05-07T20:32:17.9884643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9884741Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9884816Z         self,
2025-05-07T20:32:17.9884894Z         T: int,
2025-05-07T20:32:17.9884968Z         D: int,
2025-05-07T20:32:17.9885071Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9885161Z         contiguous: bool,
2025-05-07T20:32:17.9885247Z         compiled: bool,
2025-05-07T20:32:17.9885323Z     ) -> None:
2025-05-07T20:32:17.9885420Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9885490Z     
2025-05-07T20:32:17.9885661Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9887424Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9887509Z 
2025-05-07T20:32:17.9887627Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9887631Z 
2025-05-07T20:32:17.9887734Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9887952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9888032Z     T=2048,
2025-05-07T20:32:17.9888107Z     D=7168,
2025-05-07T20:32:17.9888190Z     scale_ub=1200.0,
2025-05-07T20:32:17.9888278Z     contiguous=True,
2025-05-07T20:32:17.9888358Z     compiled=True,
2025-05-07T20:32:17.9888431Z )
2025-05-07T20:32:17.9888653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9888823Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.9888827Z 
2025-05-07T20:32:17.9888902Z     @given(
2025-05-07T20:32:17.9889022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9889120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9889242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9889361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9889470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9889545Z     )
2025-05-07T20:32:17.9892215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9892382Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9892459Z         self,
2025-05-07T20:32:17.9892540Z         T: int,
2025-05-07T20:32:17.9892617Z         D: int,
2025-05-07T20:32:17.9892722Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9892814Z         contiguous: bool,
2025-05-07T20:32:17.9892901Z         compiled: bool,
2025-05-07T20:32:17.9893009Z     ) -> None:
2025-05-07T20:32:17.9893112Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9893205Z     
2025-05-07T20:32:17.9893379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9893451Z     
2025-05-07T20:32:17.9893591Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9893718Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9895490Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9895498Z 
2025-05-07T20:32:17.9895617Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.9895621Z 
2025-05-07T20:32:17.9895727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9895947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9896028Z     T=2048,
2025-05-07T20:32:17.9896107Z     D=7168,
2025-05-07T20:32:17.9896188Z     scale_ub=None,
2025-05-07T20:32:17.9896269Z     contiguous=True,
2025-05-07T20:32:17.9896358Z     compiled=False,
2025-05-07T20:32:17.9896429Z )
2025-05-07T20:32:17.9896645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9896827Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9896832Z 
2025-05-07T20:32:17.9896906Z     @given(
2025-05-07T20:32:17.9897022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9897120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9897231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9897349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9897459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9897531Z     )
2025-05-07T20:32:17.9897823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9897919Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9897994Z         self,
2025-05-07T20:32:17.9898072Z         T: int,
2025-05-07T20:32:17.9898145Z         D: int,
2025-05-07T20:32:17.9898240Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9898336Z         contiguous: bool,
2025-05-07T20:32:17.9898420Z         compiled: bool,
2025-05-07T20:32:17.9898499Z     ) -> None:
2025-05-07T20:32:17.9898593Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9898666Z     
2025-05-07T20:32:17.9898834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9898905Z     
2025-05-07T20:32:17.9898995Z >       x_sign = torch.sign(x)
2025-05-07T20:32:17.9900746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9900894Z 
2025-05-07T20:32:17.9901017Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:17.9901022Z 
2025-05-07T20:32:17.9901205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9901427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9901506Z     T=1,
2025-05-07T20:32:17.9901587Z     D=7168,
2025-05-07T20:32:17.9901670Z     scale_ub=1200.0,
2025-05-07T20:32:17.9901755Z     contiguous=True,
2025-05-07T20:32:17.9901842Z     compiled=False,
2025-05-07T20:32:17.9901914Z )
2025-05-07T20:32:17.9902131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9902348Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9902353Z 
2025-05-07T20:32:17.9902427Z     @given(
2025-05-07T20:32:17.9902546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9902642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9902758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9902877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9902989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9903066Z     )
2025-05-07T20:32:17.9903312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9903404Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9903483Z         self,
2025-05-07T20:32:17.9903557Z         T: int,
2025-05-07T20:32:17.9903632Z         D: int,
2025-05-07T20:32:17.9903730Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9903817Z         contiguous: bool,
2025-05-07T20:32:17.9903906Z         compiled: bool,
2025-05-07T20:32:17.9903985Z     ) -> None:
2025-05-07T20:32:17.9904080Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9904151Z     
2025-05-07T20:32:17.9904319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9904394Z     
2025-05-07T20:32:17.9904491Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9904614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9904700Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9904782Z         x0 = x[:, :D]
2025-05-07T20:32:17.9904860Z         x1 = x[:, D:]
2025-05-07T20:32:17.9904931Z     
2025-05-07T20:32:17.9905017Z         if contiguous:
2025-05-07T20:32:17.9905107Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9905198Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9905269Z     
2025-05-07T20:32:17.9905357Z         if scale_ub is not None:
2025-05-07T20:32:17.9905464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9905646Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9905721Z             )
2025-05-07T20:32:17.9905800Z         else:
2025-05-07T20:32:17.9905892Z             scale_ub_tensor = None
2025-05-07T20:32:17.9905963Z     
2025-05-07T20:32:17.9906095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9906188Z             op = silu_mul_quant
2025-05-07T20:32:17.9906273Z             if compiled:
2025-05-07T20:32:17.9906377Z                 op = torch.compile(op)
2025-05-07T20:32:17.9906479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9906556Z     
2025-05-07T20:32:17.9906646Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9906650Z 
2025-05-07T20:32:17.9906745Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9906874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9906973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9907074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9907582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9907675Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9908094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9908361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9908700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9908796Z     kernel = self.compile(
2025-05-07T20:32:17.9909176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9909350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9909477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9909527Z 
2025-05-07T20:32:17.9909734Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e9646370>
2025-05-07T20:32:17.9910514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9911016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e93ec040>}
2025-05-07T20:32:17.9911758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9911947Z context = <triton._C.libtriton.ir.context object at 0x7fd7e93eb2f0>
2025-05-07T20:32:17.9911952Z 
2025-05-07T20:32:17.9912119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9912392Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9912499Z                            module_map=module_map)
2025-05-07T20:32:17.9912659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9912766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9912841Z E       ^
2025-05-07T20:32:17.9913196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9913201Z 
2025-05-07T20:32:17.9913611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9913616Z 
2025-05-07T20:32:17.9913716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9913939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9914020Z     T=128,
2025-05-07T20:32:17.9914141Z     D=5120,
2025-05-07T20:32:17.9914221Z     scale_ub=None,
2025-05-07T20:32:17.9914304Z     contiguous=True,
2025-05-07T20:32:17.9914392Z     compiled=False,
2025-05-07T20:32:17.9914465Z )
2025-05-07T20:32:17.9914681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9914856Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9914861Z 
2025-05-07T20:32:17.9914936Z     @given(
2025-05-07T20:32:17.9915053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9915151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9915263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9915381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9915492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9915563Z     )
2025-05-07T20:32:17.9915813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9915909Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9915983Z         self,
2025-05-07T20:32:17.9916062Z         T: int,
2025-05-07T20:32:17.9916136Z         D: int,
2025-05-07T20:32:17.9916233Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9916323Z         contiguous: bool,
2025-05-07T20:32:17.9916492Z         compiled: bool,
2025-05-07T20:32:17.9916570Z     ) -> None:
2025-05-07T20:32:17.9916669Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9916743Z     
2025-05-07T20:32:17.9916910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9916985Z     
2025-05-07T20:32:17.9917074Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9917198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9917284Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9917362Z         x0 = x[:, :D]
2025-05-07T20:32:17.9917446Z         x1 = x[:, D:]
2025-05-07T20:32:17.9917518Z     
2025-05-07T20:32:17.9917650Z         if contiguous:
2025-05-07T20:32:17.9917745Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9917833Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9917904Z     
2025-05-07T20:32:17.9917999Z         if scale_ub is not None:
2025-05-07T20:32:17.9918104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9918242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9918322Z             )
2025-05-07T20:32:17.9918398Z         else:
2025-05-07T20:32:17.9918496Z             scale_ub_tensor = None
2025-05-07T20:32:17.9918567Z     
2025-05-07T20:32:17.9918695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9918785Z             op = silu_mul_quant
2025-05-07T20:32:17.9918869Z             if compiled:
2025-05-07T20:32:17.9918969Z                 op = torch.compile(op)
2025-05-07T20:32:17.9919075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9919145Z     
2025-05-07T20:32:17.9919234Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9919243Z 
2025-05-07T20:32:17.9919346Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9919474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9919575Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9919673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9920176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9920277Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9920639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9920859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9921199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9921290Z     kernel = self.compile(
2025-05-07T20:32:17.9921723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9921898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9922022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9922031Z 
2025-05-07T20:32:17.9922239Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e940b760>
2025-05-07T20:32:17.9923045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9923560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e93ec9d0>}
2025-05-07T20:32:17.9924300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9924493Z context = <triton._C.libtriton.ir.context object at 0x7fd7e966a7f0>
2025-05-07T20:32:17.9924500Z 
2025-05-07T20:32:17.9924710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9925007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9925116Z                            module_map=module_map)
2025-05-07T20:32:17.9925277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9925373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9925453Z E       ^
2025-05-07T20:32:17.9925806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9925811Z 
2025-05-07T20:32:17.9926228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9926272Z 
2025-05-07T20:32:17.9926379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9926601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9926683Z     T=128,
2025-05-07T20:32:17.9926760Z     D=7168,
2025-05-07T20:32:17.9926841Z     scale_ub=None,
2025-05-07T20:32:17.9926928Z     contiguous=True,
2025-05-07T20:32:17.9927012Z     compiled=False,
2025-05-07T20:32:17.9927083Z )
2025-05-07T20:32:17.9927303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9927470Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9927474Z 
2025-05-07T20:32:17.9927554Z     @given(
2025-05-07T20:32:17.9927670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9927766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9927888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9928005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9928119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9928198Z     )
2025-05-07T20:32:17.9928441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9928538Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9928615Z         self,
2025-05-07T20:32:17.9928696Z         T: int,
2025-05-07T20:32:17.9928770Z         D: int,
2025-05-07T20:32:17.9928868Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9928955Z         contiguous: bool,
2025-05-07T20:32:17.9929041Z         compiled: bool,
2025-05-07T20:32:17.9929121Z     ) -> None:
2025-05-07T20:32:17.9929214Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9929284Z     
2025-05-07T20:32:17.9929454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9929525Z     
2025-05-07T20:32:17.9929665Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9929791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9929878Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9929961Z         x0 = x[:, :D]
2025-05-07T20:32:17.9930038Z         x1 = x[:, D:]
2025-05-07T20:32:17.9930108Z     
2025-05-07T20:32:17.9930196Z         if contiguous:
2025-05-07T20:32:17.9930291Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9930377Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9930452Z     
2025-05-07T20:32:17.9930542Z         if scale_ub is not None:
2025-05-07T20:32:17.9930645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9930783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9930857Z             )
2025-05-07T20:32:17.9930938Z         else:
2025-05-07T20:32:17.9931029Z             scale_ub_tensor = None
2025-05-07T20:32:17.9931103Z     
2025-05-07T20:32:17.9931232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9931325Z             op = silu_mul_quant
2025-05-07T20:32:17.9931410Z             if compiled:
2025-05-07T20:32:17.9931511Z                 op = torch.compile(op)
2025-05-07T20:32:17.9931614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9931686Z     
2025-05-07T20:32:17.9931780Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9931890Z 
2025-05-07T20:32:17.9931989Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9932114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9932216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9932314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9932812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9932905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9933265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9933534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9933873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9933973Z     kernel = self.compile(
2025-05-07T20:32:17.9934363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9934540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9934665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9934670Z 
2025-05-07T20:32:17.9934870Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e967f700>
2025-05-07T20:32:17.9935642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9936153Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e94f1430>}
2025-05-07T20:32:17.9936891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9937092Z context = <triton._C.libtriton.ir.context object at 0x7fd7e94e3eb0>
2025-05-07T20:32:17.9937097Z 
2025-05-07T20:32:17.9937259Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9937522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9937627Z                            module_map=module_map)
2025-05-07T20:32:17.9937787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9937929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9938006Z E       ^
2025-05-07T20:32:17.9938356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9938361Z 
2025-05-07T20:32:17.9938778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9938785Z 
2025-05-07T20:32:17.9938887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9939109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9939185Z     T=2048,
2025-05-07T20:32:17.9939259Z     D=7168,
2025-05-07T20:32:17.9939343Z     scale_ub=1200.0,
2025-05-07T20:32:17.9939427Z     contiguous=True,
2025-05-07T20:32:17.9939508Z     compiled=False,
2025-05-07T20:32:17.9939584Z )
2025-05-07T20:32:17.9939799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9939990Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9939996Z 
2025-05-07T20:32:17.9940381Z     @given(
2025-05-07T20:32:17.9940544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9940649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9940913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9941032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9941197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9941270Z     )
2025-05-07T20:32:17.9941514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9941611Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9941687Z         self,
2025-05-07T20:32:17.9941763Z         T: int,
2025-05-07T20:32:17.9941837Z         D: int,
2025-05-07T20:32:17.9941933Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9942022Z         contiguous: bool,
2025-05-07T20:32:17.9942179Z         compiled: bool,
2025-05-07T20:32:17.9942256Z     ) -> None:
2025-05-07T20:32:17.9942350Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9942420Z     
2025-05-07T20:32:17.9942587Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9944389Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9944398Z 
2025-05-07T20:32:17.9944517Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9944522Z 
2025-05-07T20:32:17.9944628Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9944848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9944930Z     T=1,
2025-05-07T20:32:17.9945006Z     D=5120,
2025-05-07T20:32:17.9945089Z     scale_ub=1200.0,
2025-05-07T20:32:17.9945177Z     contiguous=True,
2025-05-07T20:32:17.9945268Z     compiled=False,
2025-05-07T20:32:17.9945341Z )
2025-05-07T20:32:17.9945556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9945719Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9945723Z 
2025-05-07T20:32:17.9945797Z     @given(
2025-05-07T20:32:17.9945916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9946014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9946124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9946244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9946422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9946501Z     )
2025-05-07T20:32:17.9946744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9946835Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9946915Z         self,
2025-05-07T20:32:17.9946995Z         T: int,
2025-05-07T20:32:17.9947070Z         D: int,
2025-05-07T20:32:17.9947170Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9947255Z         contiguous: bool,
2025-05-07T20:32:17.9947338Z         compiled: bool,
2025-05-07T20:32:17.9947416Z     ) -> None:
2025-05-07T20:32:17.9947509Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9947579Z     
2025-05-07T20:32:17.9947749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9947821Z     
2025-05-07T20:32:17.9947915Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.9948036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.9948131Z         x = x_sign * x_clamp
2025-05-07T20:32:17.9948213Z         x0 = x[:, :D]
2025-05-07T20:32:17.9948291Z         x1 = x[:, D:]
2025-05-07T20:32:17.9948360Z     
2025-05-07T20:32:17.9948445Z         if contiguous:
2025-05-07T20:32:17.9948535Z             x0 = x0.contiguous()
2025-05-07T20:32:17.9948624Z             x1 = x1.contiguous()
2025-05-07T20:32:17.9948781Z     
2025-05-07T20:32:17.9948870Z         if scale_ub is not None:
2025-05-07T20:32:17.9948973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.9949112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.9949187Z             )
2025-05-07T20:32:17.9949267Z         else:
2025-05-07T20:32:17.9949358Z             scale_ub_tensor = None
2025-05-07T20:32:17.9949428Z     
2025-05-07T20:32:17.9949562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.9949650Z             op = silu_mul_quant
2025-05-07T20:32:17.9949733Z             if compiled:
2025-05-07T20:32:17.9949880Z                 op = torch.compile(op)
2025-05-07T20:32:17.9949984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9950055Z     
2025-05-07T20:32:17.9950153Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.9950157Z 
2025-05-07T20:32:17.9950253Z moe/activation_test.py:117: 
2025-05-07T20:32:17.9950384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9950487Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.9950588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.9951092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.9951188Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.9951543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.9951768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.9952116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.9952212Z     kernel = self.compile(
2025-05-07T20:32:17.9952588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.9952767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.9952893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.9952898Z 
2025-05-07T20:32:17.9953101Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e95d9f10>
2025-05-07T20:32:17.9953873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.9954412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9412160>}
2025-05-07T20:32:17.9955167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.9955362Z context = <triton._C.libtriton.ir.context object at 0x7fd7e9423f30>
2025-05-07T20:32:17.9955366Z 
2025-05-07T20:32:17.9955529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.9955791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.9955897Z                            module_map=module_map)
2025-05-07T20:32:17.9956054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.9956152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.9956227Z E       ^
2025-05-07T20:32:17.9956584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.9956591Z 
2025-05-07T20:32:17.9957007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.9957011Z 
2025-05-07T20:32:17.9957197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9957421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9957496Z     T=2048,
2025-05-07T20:32:17.9957570Z     D=5120,
2025-05-07T20:32:17.9957656Z     scale_ub=None,
2025-05-07T20:32:17.9957740Z     contiguous=True,
2025-05-07T20:32:17.9957822Z     compiled=False,
2025-05-07T20:32:17.9957898Z )
2025-05-07T20:32:17.9958110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9958285Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9958289Z 
2025-05-07T20:32:17.9958406Z     @given(
2025-05-07T20:32:17.9958525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9958627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9958739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9958855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9958976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9959047Z     )
2025-05-07T20:32:17.9959289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9959383Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9959457Z         self,
2025-05-07T20:32:17.9959535Z         T: int,
2025-05-07T20:32:17.9959609Z         D: int,
2025-05-07T20:32:17.9959707Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9959799Z         contiguous: bool,
2025-05-07T20:32:17.9959884Z         compiled: bool,
2025-05-07T20:32:17.9959961Z     ) -> None:
2025-05-07T20:32:17.9960057Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9960133Z     
2025-05-07T20:32:17.9960298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9960373Z     
2025-05-07T20:32:17.9960463Z >       x_sign = torch.sign(x)
2025-05-07T20:32:17.9962261Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9962270Z 
2025-05-07T20:32:17.9962388Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:17.9962392Z 
2025-05-07T20:32:17.9962499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9962800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9962892Z     T=16384,
2025-05-07T20:32:17.9962980Z     D=5120,
2025-05-07T20:32:17.9963074Z     scale_ub=None,
2025-05-07T20:32:17.9963156Z     contiguous=True,
2025-05-07T20:32:17.9963245Z     compiled=False,
2025-05-07T20:32:17.9963317Z )
2025-05-07T20:32:17.9963531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9963707Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9963711Z 
2025-05-07T20:32:17.9963785Z     @given(
2025-05-07T20:32:17.9963899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9963999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9964110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9964226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9964342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9964416Z     )
2025-05-07T20:32:17.9964662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9964755Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9964829Z         self,
2025-05-07T20:32:17.9964907Z         T: int,
2025-05-07T20:32:17.9965065Z         D: int,
2025-05-07T20:32:17.9965164Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9965253Z         contiguous: bool,
2025-05-07T20:32:17.9965337Z         compiled: bool,
2025-05-07T20:32:17.9965415Z     ) -> None:
2025-05-07T20:32:17.9965507Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9965577Z     
2025-05-07T20:32:17.9965747Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9967544Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9967591Z 
2025-05-07T20:32:17.9967714Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9967718Z 
2025-05-07T20:32:17.9967818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9968038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9968116Z     T=4096,
2025-05-07T20:32:17.9968190Z     D=5120,
2025-05-07T20:32:17.9968271Z     scale_ub=None,
2025-05-07T20:32:17.9968356Z     contiguous=True,
2025-05-07T20:32:17.9968439Z     compiled=False,
2025-05-07T20:32:17.9968513Z )
2025-05-07T20:32:17.9968730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9968907Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.9968912Z 
2025-05-07T20:32:17.9968990Z     @given(
2025-05-07T20:32:17.9969103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9969201Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9969321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9969439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9969551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9969625Z     )
2025-05-07T20:32:17.9969867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9969961Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9970037Z         self,
2025-05-07T20:32:17.9970113Z         T: int,
2025-05-07T20:32:17.9970189Z         D: int,
2025-05-07T20:32:17.9970285Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9970371Z         contiguous: bool,
2025-05-07T20:32:17.9970509Z         compiled: bool,
2025-05-07T20:32:17.9970586Z     ) -> None:
2025-05-07T20:32:17.9970678Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9970757Z     
2025-05-07T20:32:17.9970921Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9972699Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9972705Z 
2025-05-07T20:32:17.9972820Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9972829Z 
2025-05-07T20:32:17.9972931Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9973149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9973227Z     T=2048,
2025-05-07T20:32:17.9973304Z     D=5120,
2025-05-07T20:32:17.9973387Z     scale_ub=None,
2025-05-07T20:32:17.9973555Z     contiguous=False,
2025-05-07T20:32:17.9973642Z     compiled=False,
2025-05-07T20:32:17.9973713Z )
2025-05-07T20:32:17.9973927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9974102Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.9974106Z 
2025-05-07T20:32:17.9974181Z     @given(
2025-05-07T20:32:17.9974294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9974394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9974505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9974624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9974777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9974849Z     )
2025-05-07T20:32:17.9975093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9975184Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9975261Z         self,
2025-05-07T20:32:17.9975340Z         T: int,
2025-05-07T20:32:17.9975415Z         D: int,
2025-05-07T20:32:17.9975511Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9975600Z         contiguous: bool,
2025-05-07T20:32:17.9975684Z         compiled: bool,
2025-05-07T20:32:17.9975771Z     ) -> None:
2025-05-07T20:32:17.9975864Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9975934Z     
2025-05-07T20:32:17.9976101Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9977881Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9977891Z 
2025-05-07T20:32:17.9978009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9978014Z 
2025-05-07T20:32:17.9978113Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9978332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9978410Z     T=4096,
2025-05-07T20:32:17.9978485Z     D=7168,
2025-05-07T20:32:17.9978565Z     scale_ub=None,
2025-05-07T20:32:17.9978653Z     contiguous=True,
2025-05-07T20:32:17.9978734Z     compiled=True,
2025-05-07T20:32:17.9978804Z )
2025-05-07T20:32:17.9979066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9979235Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.9979239Z 
2025-05-07T20:32:17.9979318Z     @given(
2025-05-07T20:32:17.9979432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9979534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9979651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9979766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9979876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9979953Z     )
2025-05-07T20:32:17.9980396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9980492Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9980566Z         self,
2025-05-07T20:32:17.9980641Z         T: int,
2025-05-07T20:32:17.9980718Z         D: int,
2025-05-07T20:32:17.9980821Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9980908Z         contiguous: bool,
2025-05-07T20:32:17.9981002Z         compiled: bool,
2025-05-07T20:32:17.9981119Z     ) -> None:
2025-05-07T20:32:17.9981215Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9981289Z     
2025-05-07T20:32:17.9981509Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9983301Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9983345Z 
2025-05-07T20:32:17.9983466Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9983471Z 
2025-05-07T20:32:17.9983574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9983792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9983867Z     T=2048,
2025-05-07T20:32:17.9983946Z     D=5120,
2025-05-07T20:32:17.9984030Z     scale_ub=1200.0,
2025-05-07T20:32:17.9984114Z     contiguous=False,
2025-05-07T20:32:17.9984199Z     compiled=False,
2025-05-07T20:32:17.9984269Z )
2025-05-07T20:32:17.9984483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9984657Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.9984662Z 
2025-05-07T20:32:17.9984736Z     @given(
2025-05-07T20:32:17.9984850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9984949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9985063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9985183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9985292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9985364Z     )
2025-05-07T20:32:17.9985610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9985707Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9985781Z         self,
2025-05-07T20:32:17.9985859Z         T: int,
2025-05-07T20:32:17.9985934Z         D: int,
2025-05-07T20:32:17.9986028Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9986116Z         contiguous: bool,
2025-05-07T20:32:17.9986200Z         compiled: bool,
2025-05-07T20:32:17.9986279Z     ) -> None:
2025-05-07T20:32:17.9986371Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9986444Z     
2025-05-07T20:32:17.9986611Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9988395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9988406Z 
2025-05-07T20:32:17.9988527Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9988531Z 
2025-05-07T20:32:17.9988633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9988852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9988929Z     T=4096,
2025-05-07T20:32:17.9989003Z     D=7168,
2025-05-07T20:32:17.9989085Z     scale_ub=1200.0,
2025-05-07T20:32:17.9989170Z     contiguous=True,
2025-05-07T20:32:17.9989260Z     compiled=False,
2025-05-07T20:32:17.9989332Z )
2025-05-07T20:32:17.9989547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9989716Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.9989720Z 
2025-05-07T20:32:17.9989839Z     @given(
2025-05-07T20:32:17.9989993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9990090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9990210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9994040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9994172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9994247Z     )
2025-05-07T20:32:17.9994497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9994588Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9994663Z         self,
2025-05-07T20:32:17.9994833Z         T: int,
2025-05-07T20:32:17.9994912Z         D: int,
2025-05-07T20:32:17.9995008Z         scale_ub: Optional[float],
2025-05-07T20:32:17.9995096Z         contiguous: bool,
2025-05-07T20:32:17.9995182Z         compiled: bool,
2025-05-07T20:32:17.9995261Z     ) -> None:
2025-05-07T20:32:17.9995359Z         torch.manual_seed(2025)
2025-05-07T20:32:17.9995435Z     
2025-05-07T20:32:17.9995607Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.9997373Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.9997381Z 
2025-05-07T20:32:17.9997502Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.9997507Z 
2025-05-07T20:32:17.9997608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.9997828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.9997915Z     T=16384,
2025-05-07T20:32:17.9997991Z     D=7168,
2025-05-07T20:32:17.9998073Z     scale_ub=None,
2025-05-07T20:32:17.9998162Z     contiguous=False,
2025-05-07T20:32:17.9998245Z     compiled=True,
2025-05-07T20:32:17.9998319Z )
2025-05-07T20:32:17.9998538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.9998714Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.9998718Z 
2025-05-07T20:32:17.9998796Z     @given(
2025-05-07T20:32:17.9998911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.9999054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.9999177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.9999289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.9999398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.9999474Z     )
2025-05-07T20:32:17.9999719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.9999812Z     def test_silu_mul_quant(
2025-05-07T20:32:17.9999887Z         self,
2025-05-07T20:32:17.9999961Z         T: int,
2025-05-07T20:32:18.0000038Z         D: int,
2025-05-07T20:32:18.0000135Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0000220Z         contiguous: bool,
2025-05-07T20:32:18.0000310Z         compiled: bool,
2025-05-07T20:32:18.0000385Z     ) -> None:
2025-05-07T20:32:18.0000477Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0000551Z     
2025-05-07T20:32:18.0000716Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0002561Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0002602Z 
2025-05-07T20:32:18.0002725Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0002729Z 
2025-05-07T20:32:18.0002834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0003053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0003128Z     T=4096,
2025-05-07T20:32:18.0003208Z     D=7168,
2025-05-07T20:32:18.0003329Z     scale_ub=None,
2025-05-07T20:32:18.0003415Z     contiguous=True,
2025-05-07T20:32:18.0003501Z     compiled=False,
2025-05-07T20:32:18.0003572Z )
2025-05-07T20:32:18.0003786Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0003961Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.0003971Z 
2025-05-07T20:32:18.0004045Z     @given(
2025-05-07T20:32:18.0004159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0004263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0004376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0004493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0004604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0004677Z     )
2025-05-07T20:32:18.0004922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0005014Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0005093Z         self,
2025-05-07T20:32:18.0005172Z         T: int,
2025-05-07T20:32:18.0005246Z         D: int,
2025-05-07T20:32:18.0005340Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0005432Z         contiguous: bool,
2025-05-07T20:32:18.0005515Z         compiled: bool,
2025-05-07T20:32:18.0005596Z     ) -> None:
2025-05-07T20:32:18.0005691Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0005762Z     
2025-05-07T20:32:18.0005929Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0007719Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0007730Z 
2025-05-07T20:32:18.0007849Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0007854Z 
2025-05-07T20:32:18.0007954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0008177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0008256Z     T=16384,
2025-05-07T20:32:18.0008329Z     D=7168,
2025-05-07T20:32:18.0008407Z     scale_ub=None,
2025-05-07T20:32:18.0008495Z     contiguous=True,
2025-05-07T20:32:18.0008580Z     compiled=False,
2025-05-07T20:32:18.0008652Z )
2025-05-07T20:32:18.0008867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0009038Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.0009042Z 
2025-05-07T20:32:18.0009121Z     @given(
2025-05-07T20:32:18.0009242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0009342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0009457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0009575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0009684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0009844Z     )
2025-05-07T20:32:18.0010089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0010184Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0010258Z         self,
2025-05-07T20:32:18.0010332Z         T: int,
2025-05-07T20:32:18.0010411Z         D: int,
2025-05-07T20:32:18.0010509Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0010596Z         contiguous: bool,
2025-05-07T20:32:18.0010681Z         compiled: bool,
2025-05-07T20:32:18.0010758Z     ) -> None:
2025-05-07T20:32:18.0010854Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0010927Z     
2025-05-07T20:32:18.0011136Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0012889Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0012899Z 
2025-05-07T20:32:18.0013015Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0013019Z 
2025-05-07T20:32:18.0013124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0013340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0013418Z     T=16384,
2025-05-07T20:32:18.0013498Z     D=7168,
2025-05-07T20:32:18.0013578Z     scale_ub=1200.0,
2025-05-07T20:32:18.0013661Z     contiguous=True,
2025-05-07T20:32:18.0013749Z     compiled=False,
2025-05-07T20:32:18.0013822Z )
2025-05-07T20:32:18.0014035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0014220Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.0014225Z 
2025-05-07T20:32:18.0014303Z     @given(
2025-05-07T20:32:18.0014419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0014518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0014628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0014746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0014856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0014928Z     )
2025-05-07T20:32:18.0015218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0015318Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0015391Z         self,
2025-05-07T20:32:18.0015473Z         T: int,
2025-05-07T20:32:18.0015547Z         D: int,
2025-05-07T20:32:18.0015642Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0015732Z         contiguous: bool,
2025-05-07T20:32:18.0015822Z         compiled: bool,
2025-05-07T20:32:18.0015901Z     ) -> None:
2025-05-07T20:32:18.0015998Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0016069Z     
2025-05-07T20:32:18.0016234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0017980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0017987Z 
2025-05-07T20:32:18.0018104Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0018150Z 
2025-05-07T20:32:18.0018290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0018509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0018587Z     T=128,
2025-05-07T20:32:18.0018663Z     D=5120,
2025-05-07T20:32:18.0018744Z     scale_ub=1200.0,
2025-05-07T20:32:18.0018832Z     contiguous=False,
2025-05-07T20:32:18.0018912Z     compiled=False,
2025-05-07T20:32:18.0018981Z )
2025-05-07T20:32:18.0019200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0019372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:18.0019416Z 
2025-05-07T20:32:18.0019501Z     @given(
2025-05-07T20:32:18.0019615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0019711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0019824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0019937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0020056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0020131Z     )
2025-05-07T20:32:18.0020374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0020469Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0020542Z         self,
2025-05-07T20:32:18.0020615Z         T: int,
2025-05-07T20:32:18.0020693Z         D: int,
2025-05-07T20:32:18.0020787Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0020874Z         contiguous: bool,
2025-05-07T20:32:18.0020960Z         compiled: bool,
2025-05-07T20:32:18.0021036Z     ) -> None:
2025-05-07T20:32:18.0021196Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0021276Z     
2025-05-07T20:32:18.0021440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0021512Z     
2025-05-07T20:32:18.0021604Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0021727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0021823Z         x = x_sign * x_clamp
2025-05-07T20:32:18.0021903Z         x0 = x[:, :D]
2025-05-07T20:32:18.0021980Z         x1 = x[:, D:]
2025-05-07T20:32:18.0022052Z     
2025-05-07T20:32:18.0022133Z         if contiguous:
2025-05-07T20:32:18.0022221Z             x0 = x0.contiguous()
2025-05-07T20:32:18.0022310Z             x1 = x1.contiguous()
2025-05-07T20:32:18.0022382Z     
2025-05-07T20:32:18.0022470Z         if scale_ub is not None:
2025-05-07T20:32:18.0022578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.0022714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.0022790Z             )
2025-05-07T20:32:18.0022941Z         else:
2025-05-07T20:32:18.0023053Z             scale_ub_tensor = None
2025-05-07T20:32:18.0023131Z     
2025-05-07T20:32:18.0023261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.0023350Z             op = silu_mul_quant
2025-05-07T20:32:18.0023439Z             if compiled:
2025-05-07T20:32:18.0023544Z                 op = torch.compile(op)
2025-05-07T20:32:18.0023646Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0023722Z     
2025-05-07T20:32:18.0023811Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.0023816Z 
2025-05-07T20:32:18.0023911Z moe/activation_test.py:117: 
2025-05-07T20:32:18.0024040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0024141Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.0024238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0024750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.0024847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.0025211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.0025436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.0025883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.0025981Z     kernel = self.compile(
2025-05-07T20:32:18.0026366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.0026546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.0026671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0026676Z 
2025-05-07T20:32:18.0026878Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e93033a0>
2025-05-07T20:32:18.0027703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.0028204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e9301ca0>}
2025-05-07T20:32:18.0028957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.0029147Z context = <triton._C.libtriton.ir.context object at 0x7fd7e92623f0>
2025-05-07T20:32:18.0029151Z 
2025-05-07T20:32:18.0029318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.0029583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.0029691Z                            module_map=module_map)
2025-05-07T20:32:18.0029857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.0029954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.0030028Z E       ^
2025-05-07T20:32:18.0030394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.0030399Z 
2025-05-07T20:32:18.0030820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.0030825Z 
2025-05-07T20:32:18.0030928Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0031150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0031224Z     T=2048,
2025-05-07T20:32:18.0031298Z     D=7168,
2025-05-07T20:32:18.0031379Z     scale_ub=None,
2025-05-07T20:32:18.0031510Z     contiguous=False,
2025-05-07T20:32:18.0031596Z     compiled=False,
2025-05-07T20:32:18.0031670Z )
2025-05-07T20:32:18.0031885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0032056Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.0032068Z 
2025-05-07T20:32:18.0032147Z     @given(
2025-05-07T20:32:18.0032263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0032363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0032474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0032589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0032703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0032775Z     )
2025-05-07T20:32:18.0033018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0033115Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0033189Z         self,
2025-05-07T20:32:18.0033269Z         T: int,
2025-05-07T20:32:18.0033346Z         D: int,
2025-05-07T20:32:18.0033441Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0033532Z         contiguous: bool,
2025-05-07T20:32:18.0033615Z         compiled: bool,
2025-05-07T20:32:18.0033691Z     ) -> None:
2025-05-07T20:32:18.0033882Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0033960Z     
2025-05-07T20:32:18.0034126Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0035881Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0035924Z 
2025-05-07T20:32:18.0036043Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0036047Z 
2025-05-07T20:32:18.0036150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0036370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0036448Z     T=128,
2025-05-07T20:32:18.0036528Z     D=7168,
2025-05-07T20:32:18.0036608Z     scale_ub=1200.0,
2025-05-07T20:32:18.0036697Z     contiguous=True,
2025-05-07T20:32:18.0036782Z     compiled=True,
2025-05-07T20:32:18.0036859Z )
2025-05-07T20:32:18.0037078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0037244Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.0037248Z 
2025-05-07T20:32:18.0037321Z     @given(
2025-05-07T20:32:18.0037441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0037543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0037654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0037773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0037884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0037957Z     )
2025-05-07T20:32:18.0038204Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0038296Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0038373Z         self,
2025-05-07T20:32:18.0038447Z         T: int,
2025-05-07T20:32:18.0038520Z         D: int,
2025-05-07T20:32:18.0038619Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0038705Z         contiguous: bool,
2025-05-07T20:32:18.0038787Z         compiled: bool,
2025-05-07T20:32:18.0038867Z     ) -> None:
2025-05-07T20:32:18.0038959Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0039030Z     
2025-05-07T20:32:18.0039236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0039315Z     
2025-05-07T20:32:18.0039407Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0039529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0039615Z         x = x_sign * x_clamp
2025-05-07T20:32:18.0039697Z         x0 = x[:, :D]
2025-05-07T20:32:18.0039777Z         x1 = x[:, D:]
2025-05-07T20:32:18.0039850Z     
2025-05-07T20:32:18.0039935Z         if contiguous:
2025-05-07T20:32:18.0040023Z             x0 = x0.contiguous()
2025-05-07T20:32:18.0040464Z             x1 = x1.contiguous()
2025-05-07T20:32:18.0040574Z     
2025-05-07T20:32:18.0040700Z         if scale_ub is not None:
2025-05-07T20:32:18.0040847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.0041068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.0041215Z             )
2025-05-07T20:32:18.0041309Z         else:
2025-05-07T20:32:18.0041405Z             scale_ub_tensor = None
2025-05-07T20:32:18.0041478Z     
2025-05-07T20:32:18.0041620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.0041707Z             op = silu_mul_quant
2025-05-07T20:32:18.0041793Z             if compiled:
2025-05-07T20:32:18.0041893Z                 op = torch.compile(op)
2025-05-07T20:32:18.0041999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0042221Z     
2025-05-07T20:32:18.0042319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.0042324Z 
2025-05-07T20:32:18.0042421Z moe/activation_test.py:117: 
2025-05-07T20:32:18.0042550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0042655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.0042752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0043184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.0043275Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.0043767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.0043927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.0044283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.0044513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.0044861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.0044954Z     kernel = self.compile(
2025-05-07T20:32:18.0045334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.0045507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.0045631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0045636Z 
2025-05-07T20:32:18.0045850Z self = <triton.compiler.compiler.ASTSource object at 0x7fd7e92afaf0>
2025-05-07T20:32:18.0046619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.0047134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd7de6458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd7e95ae280>}
2025-05-07T20:32:18.0047881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.0048080Z context = <triton._C.libtriton.ir.context object at 0x7fd7a0412df0>
2025-05-07T20:32:18.0048084Z 
2025-05-07T20:32:18.0048250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.0048578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.0048691Z                            module_map=module_map)
2025-05-07T20:32:18.0048855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.0048952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.0049036Z E       ^
2025-05-07T20:32:18.0049391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.0049396Z 
2025-05-07T20:32:18.0049805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.0049809Z 
2025-05-07T20:32:18.0049908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0050127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0050205Z     T=128,
2025-05-07T20:32:18.0050280Z     D=7168,
2025-05-07T20:32:18.0050368Z     scale_ub=1200.0,
2025-05-07T20:32:18.0050454Z     contiguous=True,
2025-05-07T20:32:18.0050535Z     compiled=False,
2025-05-07T20:32:18.0050607Z )
2025-05-07T20:32:18.0050827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0051033Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.0051075Z 
2025-05-07T20:32:18.0051154Z     @given(
2025-05-07T20:32:18.0051270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0051369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0051490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0051603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0051714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0051788Z     )
2025-05-07T20:32:18.0052032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0052135Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0052251Z         self,
2025-05-07T20:32:18.0052326Z         T: int,
2025-05-07T20:32:18.0052405Z         D: int,
2025-05-07T20:32:18.0052500Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0052590Z         contiguous: bool,
2025-05-07T20:32:18.0052678Z         compiled: bool,
2025-05-07T20:32:18.0052764Z     ) -> None:
2025-05-07T20:32:18.0052867Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0052957Z     
2025-05-07T20:32:18.0053149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0053220Z     
2025-05-07T20:32:18.0053311Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0053433Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0055220Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0055233Z 
2025-05-07T20:32:18.0055353Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:18.0055358Z 
2025-05-07T20:32:18.0055462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0055682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0055757Z     T=128,
2025-05-07T20:32:18.0055833Z     D=5120,
2025-05-07T20:32:18.0055921Z     scale_ub=1200.0,
2025-05-07T20:32:18.0055999Z     contiguous=True,
2025-05-07T20:32:18.0056082Z     compiled=True,
2025-05-07T20:32:18.0056155Z )
2025-05-07T20:32:18.0056366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0056580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.0056585Z 
2025-05-07T20:32:18.0056660Z     @given(
2025-05-07T20:32:18.0056775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0056875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0056992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0057109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0057223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0057294Z     )
2025-05-07T20:32:18.0057541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0057633Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0057707Z         self,
2025-05-07T20:32:18.0057783Z         T: int,
2025-05-07T20:32:18.0057856Z         D: int,
2025-05-07T20:32:18.0057952Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0058045Z         contiguous: bool,
2025-05-07T20:32:18.0058134Z         compiled: bool,
2025-05-07T20:32:18.0058211Z     ) -> None:
2025-05-07T20:32:18.0058306Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0058378Z     
2025-05-07T20:32:18.0058540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0058614Z     
2025-05-07T20:32:18.0058771Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0058932Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0060710Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0060764Z 
2025-05-07T20:32:18.0060886Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:18.0060890Z 
2025-05-07T20:32:18.0060991Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0061284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0061368Z     T=128,
2025-05-07T20:32:18.0061443Z     D=7168,
2025-05-07T20:32:18.0061522Z     scale_ub=None,
2025-05-07T20:32:18.0061607Z     contiguous=True,
2025-05-07T20:32:18.0061688Z     compiled=True,
2025-05-07T20:32:18.0061758Z )
2025-05-07T20:32:18.0061972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0062134Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:18.0062138Z 
2025-05-07T20:32:18.0062214Z     @given(
2025-05-07T20:32:18.0062330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0062426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0062553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0062668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0062780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0062854Z     )
2025-05-07T20:32:18.0063100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0063194Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0063276Z         self,
2025-05-07T20:32:18.0063351Z         T: int,
2025-05-07T20:32:18.0063429Z         D: int,
2025-05-07T20:32:18.0063524Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0063609Z         contiguous: bool,
2025-05-07T20:32:18.0063696Z         compiled: bool,
2025-05-07T20:32:18.0063772Z     ) -> None:
2025-05-07T20:32:18.0063865Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0063939Z     
2025-05-07T20:32:18.0064104Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0065891Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.0065903Z 
2025-05-07T20:32:18.0066020Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.0066152Z =============================== warnings summary ===============================
2025-05-07T20:32:18.0066461Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:18.0066762Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:18.0067058Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:18.0067960Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:18.0068230Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:18.0068239Z 
2025-05-07T20:32:18.0068448Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:18.0068615Z ================= 1 failed, 1 deselected, 3 warnings in 24.05s =================
2025-05-07T20:32:19.6289721Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:19.6931904Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:19.6932164Z 
2025-05-07T20:32:21.6948931Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:23.8532853Z ============================= test session starts ==============================
2025-05-07T20:32:23.8533485Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:23.8534024Z cachedir: .pytest_cache
2025-05-07T20:32:23.8534607Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:23.8535385Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:23.8535801Z plugins: hypothesis-6.131.14
2025-05-07T20:32:25.4755442Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:25.6872946Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:25.6873372Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:25.6873593Z 
2025-05-07T20:32:28.4026568Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.4027301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.4027732Z     T=1,
2025-05-07T20:32:28.4027935Z     D=5120,
2025-05-07T20:32:28.4028138Z     scale_ub=None,
2025-05-07T20:32:28.4028372Z     contiguous=True,
2025-05-07T20:32:28.4028611Z     compiled=True,
2025-05-07T20:32:28.4028830Z )
2025-05-07T20:32:28.4029168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.4029672Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.4029942Z 
2025-05-07T20:32:28.4030041Z     @given(
2025-05-07T20:32:28.4030591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.4030928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.4031246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.4031582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.4031933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.4032232Z     )
2025-05-07T20:32:28.4032585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.4033041Z     def test_silu_mul_quant(
2025-05-07T20:32:28.4033293Z         self,
2025-05-07T20:32:28.4033502Z         T: int,
2025-05-07T20:32:28.4033705Z         D: int,
2025-05-07T20:32:28.4033933Z         scale_ub: Optional[float],
2025-05-07T20:32:28.4034218Z         contiguous: bool,
2025-05-07T20:32:28.4034464Z         compiled: bool,
2025-05-07T20:32:28.4034702Z     ) -> None:
2025-05-07T20:32:28.4034933Z         torch.manual_seed(2025)
2025-05-07T20:32:28.4035188Z     
2025-05-07T20:32:28.4035472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.4035829Z     
2025-05-07T20:32:28.4036029Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.4036332Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.4036808Z         x = x_sign * x_clamp
2025-05-07T20:32:28.4037059Z         x0 = x[:, :D]
2025-05-07T20:32:28.4037282Z         x1 = x[:, D:]
2025-05-07T20:32:28.4037497Z     
2025-05-07T20:32:28.4037688Z         if contiguous:
2025-05-07T20:32:28.4037934Z             x0 = x0.contiguous()
2025-05-07T20:32:28.4038201Z             x1 = x1.contiguous()
2025-05-07T20:32:28.4038446Z     
2025-05-07T20:32:28.4038649Z         if scale_ub is not None:
2025-05-07T20:32:28.4038933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.4039284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.4039597Z             )
2025-05-07T20:32:28.4039902Z         else:
2025-05-07T20:32:28.4040363Z             scale_ub_tensor = None
2025-05-07T20:32:28.4040624Z     
2025-05-07T20:32:28.4040871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.4041203Z             op = silu_mul_quant
2025-05-07T20:32:28.4041462Z             if compiled:
2025-05-07T20:32:28.4041729Z                 op = torch.compile(op)
2025-05-07T20:32:28.4042038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.4042318Z     
2025-05-07T20:32:28.4042522Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.4042820Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.4043116Z     
2025-05-07T20:32:28.4043366Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.4043716Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.4044021Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.4044341Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.4044713Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.4045033Z     
2025-05-07T20:32:28.4045238Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.4045442Z 
2025-05-07T20:32:28.4045546Z moe/activation_test.py:126: 
2025-05-07T20:32:28.4045854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.4046245Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.4046592Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.4047394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.4048158Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.4048710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.4049401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.4050179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.4050923Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.4051692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.4052459Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.4053191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.4053828Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.4054445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.4054969Z     fn()
2025-05-07T20:32:28.4055485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.4056067Z     self.fn.run(
2025-05-07T20:32:28.4056542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.4057082Z     kernel = self.compile(
2025-05-07T20:32:28.4057695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.4058414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.4058826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.4059063Z 
2025-05-07T20:32:28.4059286Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdceb7040>
2025-05-07T20:32:28.4060370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.4061931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdcfd99d0>}
2025-05-07T20:32:28.4063282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.4064308Z context = <triton._C.libtriton.ir.context object at 0x7fbfdd44ebf0>
2025-05-07T20:32:28.4064603Z 
2025-05-07T20:32:28.4064780Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.4065307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.4065779Z                            module_map=module_map)
2025-05-07T20:32:28.4073060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.4073466Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.4073752Z E       ^
2025-05-07T20:32:28.4074242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.4074694Z 
2025-05-07T20:32:28.4075137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.4075664Z 
2025-05-07T20:32:28.4075772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.4076208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.4076620Z     T=2048,
2025-05-07T20:32:28.4076813Z     D=5120,
2025-05-07T20:32:28.4077020Z     scale_ub=1200.0,
2025-05-07T20:32:28.4077256Z     contiguous=True,
2025-05-07T20:32:28.4077485Z     compiled=False,
2025-05-07T20:32:28.4077711Z )
2025-05-07T20:32:29.9271538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.9272567Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.9272869Z 
2025-05-07T20:32:29.9272966Z     @given(
2025-05-07T20:32:29.9273209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.9273544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.9273881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.9274243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.9274593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.9274896Z     )
2025-05-07T20:32:29.9275267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.9275724Z     def test_silu_mul_quant(
2025-05-07T20:32:29.9275986Z         self,
2025-05-07T20:32:29.9276199Z         T: int,
2025-05-07T20:32:29.9276406Z         D: int,
2025-05-07T20:32:29.9276640Z         scale_ub: Optional[float],
2025-05-07T20:32:29.9276929Z         contiguous: bool,
2025-05-07T20:32:29.9277177Z         compiled: bool,
2025-05-07T20:32:29.9277430Z     ) -> None:
2025-05-07T20:32:29.9277667Z         torch.manual_seed(2025)
2025-05-07T20:32:29.9277918Z     
2025-05-07T20:32:29.9278209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.9278573Z     
2025-05-07T20:32:29.9278773Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.9279233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.9279566Z         x = x_sign * x_clamp
2025-05-07T20:32:29.9279833Z         x0 = x[:, :D]
2025-05-07T20:32:29.9280059Z         x1 = x[:, D:]
2025-05-07T20:32:29.9280290Z     
2025-05-07T20:32:29.9280490Z         if contiguous:
2025-05-07T20:32:29.9280734Z             x0 = x0.contiguous()
2025-05-07T20:32:29.9281012Z             x1 = x1.contiguous()
2025-05-07T20:32:29.9281269Z     
2025-05-07T20:32:29.9281469Z         if scale_ub is not None:
2025-05-07T20:32:29.9281758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.9282113Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.9282516Z             )
2025-05-07T20:32:29.9282784Z         else:
2025-05-07T20:32:29.9283099Z             scale_ub_tensor = None
2025-05-07T20:32:29.9283457Z     
2025-05-07T20:32:29.9283715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.9284050Z             op = silu_mul_quant
2025-05-07T20:32:29.9284311Z             if compiled:
2025-05-07T20:32:29.9284581Z                 op = torch.compile(op)
2025-05-07T20:32:29.9284894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.9285175Z     
2025-05-07T20:32:29.9285381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.9285560Z 
2025-05-07T20:32:29.9285667Z moe/activation_test.py:117: 
2025-05-07T20:32:29.9285980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9286316Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.9286611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.9287323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.9288022Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.9288578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.9289290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.9289974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.9290518Z     kernel = self.compile(
2025-05-07T20:32:29.9291078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.9291748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.9292153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9292404Z 
2025-05-07T20:32:29.9292687Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdd04a070>
2025-05-07T20:32:29.9293801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.9295211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdbfe5e50>}
2025-05-07T20:32:29.9296580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.9297619Z context = <triton._C.libtriton.ir.context object at 0x7fbfdbb2ebf0>
2025-05-07T20:32:29.9297924Z 
2025-05-07T20:32:29.9298100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.9298653Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.9299136Z                            module_map=module_map)
2025-05-07T20:32:29.9299517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.9299978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.9300256Z E       ^
2025-05-07T20:32:29.9300723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.9301350Z 
2025-05-07T20:32:29.9301782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.9302328Z 
2025-05-07T20:32:29.9302437Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.9302874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.9303283Z     T=2048,
2025-05-07T20:32:29.9303537Z     D=5120,
2025-05-07T20:32:29.9303750Z     scale_ub=1200.0,
2025-05-07T20:32:29.9304058Z     contiguous=True,
2025-05-07T20:32:29.9304334Z     compiled=True,
2025-05-07T20:32:29.9304560Z )
2025-05-07T20:32:29.9304884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.9305400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.9305693Z 
2025-05-07T20:32:29.9305774Z     @given(
2025-05-07T20:32:29.9306023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.9306346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.9306668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.9307029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.9307406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.9307712Z     )
2025-05-07T20:32:29.9308080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.9308540Z     def test_silu_mul_quant(
2025-05-07T20:32:29.9308794Z         self,
2025-05-07T20:32:29.9309006Z         T: int,
2025-05-07T20:32:29.9309209Z         D: int,
2025-05-07T20:32:29.9309447Z         scale_ub: Optional[float],
2025-05-07T20:32:29.9309730Z         contiguous: bool,
2025-05-07T20:32:29.9309974Z         compiled: bool,
2025-05-07T20:32:29.9310216Z     ) -> None:
2025-05-07T20:32:29.9310444Z         torch.manual_seed(2025)
2025-05-07T20:32:29.9310697Z     
2025-05-07T20:32:29.9310971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.9311324Z     
2025-05-07T20:32:29.9311525Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.9311819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.9312141Z         x = x_sign * x_clamp
2025-05-07T20:32:29.9312397Z         x0 = x[:, :D]
2025-05-07T20:32:29.9312616Z         x1 = x[:, D:]
2025-05-07T20:32:29.9312845Z     
2025-05-07T20:32:29.9313040Z         if contiguous:
2025-05-07T20:32:29.9313334Z             x0 = x0.contiguous()
2025-05-07T20:32:29.9313605Z             x1 = x1.contiguous()
2025-05-07T20:32:29.9313854Z     
2025-05-07T20:32:29.9314047Z         if scale_ub is not None:
2025-05-07T20:32:29.9314333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.9314739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.9315160Z             )
2025-05-07T20:32:29.9315366Z         else:
2025-05-07T20:32:29.9315588Z             scale_ub_tensor = None
2025-05-07T20:32:29.9315847Z     
2025-05-07T20:32:29.9316082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.9316405Z             op = silu_mul_quant
2025-05-07T20:32:29.9316665Z             if compiled:
2025-05-07T20:32:29.9316920Z                 op = torch.compile(op)
2025-05-07T20:32:29.9317270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.9317554Z     
2025-05-07T20:32:29.9317747Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.9318052Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.9318355Z     
2025-05-07T20:32:29.9318592Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.9318936Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.9319237Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.9319650Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.9320021Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.9320341Z     
2025-05-07T20:32:29.9320554Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.9320753Z 
2025-05-07T20:32:29.9320856Z moe/activation_test.py:126: 
2025-05-07T20:32:29.9321166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9321511Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.9321842Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.9322651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.9323469Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.9324028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.9324718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.9325413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.9326158Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.9326908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:29.9327659Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.9328397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.9329047Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.9329651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.9330188Z     fn()
2025-05-07T20:32:29.9330710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.9331304Z     self.fn.run(
2025-05-07T20:32:29.9331771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.9332310Z     kernel = self.compile(
2025-05-07T20:32:29.9332861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.9333512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.9333966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9334216Z 
2025-05-07T20:32:29.9334431Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdce8ea30>
2025-05-07T20:32:29.9335523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.9336927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdbb7ca60>}
2025-05-07T20:32:29.9338305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.9339342Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb8d1230>
2025-05-07T20:32:29.9339640Z 
2025-05-07T20:32:29.9339817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.9340626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.9341149Z                            module_map=module_map)
2025-05-07T20:32:29.9341698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.9342064Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.9342332Z E       ^
2025-05-07T20:32:29.9342801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.9343250Z 
2025-05-07T20:32:29.9343703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.9344228Z 
2025-05-07T20:32:29.9344335Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.9344761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.9345237Z     T=16384,
2025-05-07T20:32:29.9345433Z     D=7168,
2025-05-07T20:32:29.9345638Z     scale_ub=1200.0,
2025-05-07T20:32:29.9345871Z     contiguous=False,
2025-05-07T20:32:29.9346109Z     compiled=False,
2025-05-07T20:32:29.9346318Z )
2025-05-07T20:32:31.2678591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2679161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.2679457Z 
2025-05-07T20:32:31.2679544Z     @given(
2025-05-07T20:32:31.2679797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2680132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2680454Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2680793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2681141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2681441Z     )
2025-05-07T20:32:31.2681812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2682445Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2682704Z         self,
2025-05-07T20:32:31.2682908Z         T: int,
2025-05-07T20:32:31.2683118Z         D: int,
2025-05-07T20:32:31.2683353Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2683636Z         contiguous: bool,
2025-05-07T20:32:31.2683888Z         compiled: bool,
2025-05-07T20:32:31.2684129Z     ) -> None:
2025-05-07T20:32:31.2684353Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2684610Z     
2025-05-07T20:32:31.2684901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2685245Z     
2025-05-07T20:32:31.2685453Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2685757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2686080Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2686361Z         x0 = x[:, :D]
2025-05-07T20:32:31.2686587Z         x1 = x[:, D:]
2025-05-07T20:32:31.2687101Z     
2025-05-07T20:32:31.2687302Z         if contiguous:
2025-05-07T20:32:31.2687541Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2687812Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2688063Z     
2025-05-07T20:32:31.2688266Z         if scale_ub is not None:
2025-05-07T20:32:31.2688551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2688900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2689222Z             )
2025-05-07T20:32:31.2689420Z         else:
2025-05-07T20:32:31.2689642Z             scale_ub_tensor = None
2025-05-07T20:32:31.2689904Z     
2025-05-07T20:32:31.2690141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2690469Z             op = silu_mul_quant
2025-05-07T20:32:31.2690732Z             if compiled:
2025-05-07T20:32:31.2690987Z                 op = torch.compile(op)
2025-05-07T20:32:31.2691297Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2691592Z     
2025-05-07T20:32:31.2691789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.2691969Z 
2025-05-07T20:32:31.2692077Z moe/activation_test.py:117: 
2025-05-07T20:32:31.2692384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2692725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.2693187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2693894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.2694600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.2695150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2695850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2696531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2697166Z     kernel = self.compile(
2025-05-07T20:32:31.2697715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2698381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2698796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2699032Z 
2025-05-07T20:32:31.2699258Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdbaab610>
2025-05-07T20:32:31.2700345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2701843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb9e5670>}
2025-05-07T20:32:31.2703202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2704229Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb887630>
2025-05-07T20:32:31.2704529Z 
2025-05-07T20:32:31.2704700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2705240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2705719Z                            module_map=module_map)
2025-05-07T20:32:31.2706097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2706452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.2706717Z E       ^
2025-05-07T20:32:31.2707185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2707689Z 
2025-05-07T20:32:31.2708118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2708646Z 
2025-05-07T20:32:31.2708752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2709178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2709591Z     T=1,
2025-05-07T20:32:31.2709779Z     D=7168,
2025-05-07T20:32:31.2709982Z     scale_ub=None,
2025-05-07T20:32:31.2710204Z     contiguous=True,
2025-05-07T20:32:31.2710431Z     compiled=True,
2025-05-07T20:32:31.2710649Z )
2025-05-07T20:32:31.2710976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2711461Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.2711729Z 
2025-05-07T20:32:31.2711809Z     @given(
2025-05-07T20:32:31.2712047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2712379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2712690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2713028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2713363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2713651Z     )
2025-05-07T20:32:31.2714097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2714554Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2714799Z         self,
2025-05-07T20:32:31.2715003Z         T: int,
2025-05-07T20:32:31.2715210Z         D: int,
2025-05-07T20:32:31.2715434Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2715716Z         contiguous: bool,
2025-05-07T20:32:31.2715966Z         compiled: bool,
2025-05-07T20:32:31.2716191Z     ) -> None:
2025-05-07T20:32:31.2716418Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2716671Z     
2025-05-07T20:32:31.2716961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2717402Z     
2025-05-07T20:32:31.2717611Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2717910Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2718226Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2718476Z         x0 = x[:, :D]
2025-05-07T20:32:31.2718703Z         x1 = x[:, D:]
2025-05-07T20:32:31.2718920Z     
2025-05-07T20:32:31.2719113Z         if contiguous:
2025-05-07T20:32:31.2719353Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2719617Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2719869Z     
2025-05-07T20:32:31.2720068Z         if scale_ub is not None:
2025-05-07T20:32:31.2720345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2720688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2721008Z             )
2025-05-07T20:32:31.2721208Z         else:
2025-05-07T20:32:31.2721424Z             scale_ub_tensor = None
2025-05-07T20:32:31.2721684Z     
2025-05-07T20:32:31.2721923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2722248Z             op = silu_mul_quant
2025-05-07T20:32:31.2722506Z             if compiled:
2025-05-07T20:32:31.2722767Z                 op = torch.compile(op)
2025-05-07T20:32:31.2723064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2723349Z     
2025-05-07T20:32:31.2723556Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.2723844Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.2724143Z     
2025-05-07T20:32:31.2724388Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2724726Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.2725030Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.2725353Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.2725717Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2726038Z     
2025-05-07T20:32:31.2726298Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.2726503Z 
2025-05-07T20:32:31.2726611Z moe/activation_test.py:126: 
2025-05-07T20:32:31.2726915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2727308Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.2727647Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2728441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.2729214Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.2729778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2730466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2731156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.2731895Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.2732655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:31.2733457Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.2734233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.2734884Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.2735497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.2736018Z     fn()
2025-05-07T20:32:31.2736533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.2737123Z     self.fn.run(
2025-05-07T20:32:31.2737653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2738194Z     kernel = self.compile(
2025-05-07T20:32:31.2738750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2739427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2739832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2740423Z 
2025-05-07T20:32:31.2740677Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb869130>
2025-05-07T20:32:31.2741828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2743216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdba28dc0>}
2025-05-07T20:32:31.2744574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2745607Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb5492f0>
2025-05-07T20:32:31.2745910Z 
2025-05-07T20:32:31.2746084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2746618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2747120Z                            module_map=module_map)
2025-05-07T20:32:31.2747510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2747878Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.2748155Z E       ^
2025-05-07T20:32:31.2748709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2749179Z 
2025-05-07T20:32:31.2749606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2750128Z 
2025-05-07T20:32:31.2750243Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2750663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2757913Z     T=4096,
2025-05-07T20:32:31.2758246Z     D=5120,
2025-05-07T20:32:31.2758481Z     scale_ub=None,
2025-05-07T20:32:31.2758772Z     contiguous=False,
2025-05-07T20:32:31.2759103Z     compiled=False,
2025-05-07T20:32:31.2759335Z )
2025-05-07T20:32:32.9912205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9912811Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.9913111Z 
2025-05-07T20:32:32.9913195Z     @given(
2025-05-07T20:32:32.9913467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9913795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9914126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9914470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9915040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9915340Z     )
2025-05-07T20:32:32.9915699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9916154Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9916406Z         self,
2025-05-07T20:32:32.9916612Z         T: int,
2025-05-07T20:32:32.9916814Z         D: int,
2025-05-07T20:32:32.9917037Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9917324Z         contiguous: bool,
2025-05-07T20:32:32.9917574Z         compiled: bool,
2025-05-07T20:32:32.9917806Z     ) -> None:
2025-05-07T20:32:32.9918033Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9918369Z     
2025-05-07T20:32:32.9918647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9919001Z     
2025-05-07T20:32:32.9919197Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9919490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9919810Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9920059Z         x0 = x[:, :D]
2025-05-07T20:32:32.9920273Z         x1 = x[:, D:]
2025-05-07T20:32:32.9920489Z     
2025-05-07T20:32:32.9920681Z         if contiguous:
2025-05-07T20:32:32.9920917Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9921186Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9921433Z     
2025-05-07T20:32:32.9921632Z         if scale_ub is not None:
2025-05-07T20:32:32.9921907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9922258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9922579Z             )
2025-05-07T20:32:32.9922777Z         else:
2025-05-07T20:32:32.9923000Z             scale_ub_tensor = None
2025-05-07T20:32:32.9923261Z     
2025-05-07T20:32:32.9923500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9923819Z             op = silu_mul_quant
2025-05-07T20:32:32.9924080Z             if compiled:
2025-05-07T20:32:32.9924333Z                 op = torch.compile(op)
2025-05-07T20:32:32.9924642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9924933Z     
2025-05-07T20:32:32.9925128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9925306Z 
2025-05-07T20:32:32.9925408Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9925713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9926055Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9926342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9927042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9927822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9928426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9929132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9929816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9930362Z     kernel = self.compile(
2025-05-07T20:32:32.9930904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9931564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9931974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9932209Z 
2025-05-07T20:32:32.9932429Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb5ef6a0>
2025-05-07T20:32:32.9933511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9934986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb6393a0>}
2025-05-07T20:32:32.9936386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9937420Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaf670f0>
2025-05-07T20:32:32.9937713Z 
2025-05-07T20:32:32.9937887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9938459Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9938979Z                            module_map=module_map)
2025-05-07T20:32:32.9939356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9939716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9939977Z E       ^
2025-05-07T20:32:32.9940636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9941146Z 
2025-05-07T20:32:32.9941569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9942077Z 
2025-05-07T20:32:32.9942186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9942602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9943013Z     T=4096,
2025-05-07T20:32:32.9943201Z     D=7168,
2025-05-07T20:32:32.9943392Z     scale_ub=None,
2025-05-07T20:32:32.9943613Z     contiguous=False,
2025-05-07T20:32:32.9943852Z     compiled=False,
2025-05-07T20:32:32.9944054Z )
2025-05-07T20:32:32.9944377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9944878Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.9945152Z 
2025-05-07T20:32:32.9945232Z     @given(
2025-05-07T20:32:32.9945469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9945793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9946109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9946440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9946773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9947064Z     )
2025-05-07T20:32:32.9947413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9947869Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9948146Z         self,
2025-05-07T20:32:32.9948358Z         T: int,
2025-05-07T20:32:32.9948641Z         D: int,
2025-05-07T20:32:32.9948865Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9949142Z         contiguous: bool,
2025-05-07T20:32:32.9949387Z         compiled: bool,
2025-05-07T20:32:32.9949616Z     ) -> None:
2025-05-07T20:32:32.9949845Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9950098Z     
2025-05-07T20:32:32.9950376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9950725Z     
2025-05-07T20:32:32.9950921Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9951209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9951524Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9951767Z         x0 = x[:, :D]
2025-05-07T20:32:32.9951980Z         x1 = x[:, D:]
2025-05-07T20:32:32.9952188Z     
2025-05-07T20:32:32.9952373Z         if contiguous:
2025-05-07T20:32:32.9952612Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9952871Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9953117Z     
2025-05-07T20:32:32.9953315Z         if scale_ub is not None:
2025-05-07T20:32:32.9953587Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9953927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9954240Z             )
2025-05-07T20:32:32.9954430Z         else:
2025-05-07T20:32:32.9954776Z             scale_ub_tensor = None
2025-05-07T20:32:32.9955035Z     
2025-05-07T20:32:32.9955264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9955580Z             op = silu_mul_quant
2025-05-07T20:32:32.9955838Z             if compiled:
2025-05-07T20:32:32.9956086Z                 op = torch.compile(op)
2025-05-07T20:32:32.9956386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9956663Z     
2025-05-07T20:32:32.9956853Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9957026Z 
2025-05-07T20:32:32.9957126Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9957429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9957824Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9958105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9958797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9959489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9960029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9960714Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9961380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9961913Z     kernel = self.compile(
2025-05-07T20:32:32.9962449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9963109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9963507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9963740Z 
2025-05-07T20:32:32.9963956Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdd045610>
2025-05-07T20:32:32.9965050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9966421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb5a78b0>}
2025-05-07T20:32:32.9967762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9968844Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb4c96b0>
2025-05-07T20:32:32.9969136Z 
2025-05-07T20:32:32.9969304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9969834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9970311Z                            module_map=module_map)
2025-05-07T20:32:32.9970685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9971036Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9971299Z E       ^
2025-05-07T20:32:32.9971762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9972207Z 
2025-05-07T20:32:32.9972622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9973136Z 
2025-05-07T20:32:32.9973246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9973666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9974068Z     T=128,
2025-05-07T20:32:32.9974253Z     D=7168,
2025-05-07T20:32:32.9974449Z     scale_ub=None,
2025-05-07T20:32:32.9974668Z     contiguous=False,
2025-05-07T20:32:32.9974973Z     compiled=True,
2025-05-07T20:32:32.9975189Z )
2025-05-07T20:32:33.0739656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.0740363Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.0740641Z 
2025-05-07T20:32:33.0740725Z     @given(
2025-05-07T20:32:33.0740955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.0741327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.0741640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.0741974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.0742311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.0742693Z     )
2025-05-07T20:32:33.0743039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.0743485Z     def test_silu_mul_quant(
2025-05-07T20:32:33.0743727Z         self,
2025-05-07T20:32:33.0743919Z         T: int,
2025-05-07T20:32:33.0744136Z         D: int,
2025-05-07T20:32:33.0744359Z         scale_ub: Optional[float],
2025-05-07T20:32:33.0744633Z         contiguous: bool,
2025-05-07T20:32:33.0744876Z         compiled: bool,
2025-05-07T20:32:33.0745101Z     ) -> None:
2025-05-07T20:32:33.0745316Z         torch.manual_seed(2025)
2025-05-07T20:32:33.0745564Z     
2025-05-07T20:32:33.0745838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.0746186Z     
2025-05-07T20:32:33.0746381Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.0746678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.0746991Z         x = x_sign * x_clamp
2025-05-07T20:32:33.0747246Z         x0 = x[:, :D]
2025-05-07T20:32:33.0747464Z         x1 = x[:, D:]
2025-05-07T20:32:33.0747675Z     
2025-05-07T20:32:33.0747873Z         if contiguous:
2025-05-07T20:32:33.0748116Z             x0 = x0.contiguous()
2025-05-07T20:32:33.0748386Z             x1 = x1.contiguous()
2025-05-07T20:32:33.0748634Z     
2025-05-07T20:32:33.0748837Z         if scale_ub is not None:
2025-05-07T20:32:33.0749122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.0749462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.0749782Z             )
2025-05-07T20:32:33.0749983Z         else:
2025-05-07T20:32:33.0750197Z             scale_ub_tensor = None
2025-05-07T20:32:33.0750458Z     
2025-05-07T20:32:33.0750697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.0751016Z             op = silu_mul_quant
2025-05-07T20:32:33.0751278Z             if compiled:
2025-05-07T20:32:33.0751535Z                 op = torch.compile(op)
2025-05-07T20:32:33.0751985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.0752275Z     
2025-05-07T20:32:33.0752475Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.0752763Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.0753068Z     
2025-05-07T20:32:33.0753317Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.0753668Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.0753964Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.0754285Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.0754654Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.0754968Z     
2025-05-07T20:32:33.0755176Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.0755377Z 
2025-05-07T20:32:33.0755491Z moe/activation_test.py:126: 
2025-05-07T20:32:33.0755822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.0756215Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.0756591Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.0757556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.0758471Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.0759038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.0759727Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.0760418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.0761148Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.0761909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.0762704Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.0763432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.0764092Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.0764707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.0765231Z     fn()
2025-05-07T20:32:33.0765738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.0766329Z     self.fn.run(
2025-05-07T20:32:33.0766802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.0767331Z     kernel = self.compile(
2025-05-07T20:32:33.0767929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.0768606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.0769013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.0769248Z 
2025-05-07T20:32:33.0769467Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdafc6dc0>
2025-05-07T20:32:33.0770580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.0771973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb5a7e50>}
2025-05-07T20:32:33.0773359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.0774403Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaecec30>
2025-05-07T20:32:33.0774694Z 
2025-05-07T20:32:33.0774864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.0775398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.0775879Z                            module_map=module_map)
2025-05-07T20:32:33.0776259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.0776625Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.0776904Z E       ^
2025-05-07T20:32:33.0777373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.0777820Z 
2025-05-07T20:32:33.0778243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.0778765Z 
2025-05-07T20:32:33.0778872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.0779294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.0779706Z     T=128,
2025-05-07T20:32:33.0779897Z     D=7168,
2025-05-07T20:32:33.0780176Z     scale_ub=None,
2025-05-07T20:32:33.0780400Z     contiguous=False,
2025-05-07T20:32:33.0780631Z     compiled=False,
2025-05-07T20:32:33.0780845Z )
2025-05-07T20:32:33.4860663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4861269Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:33.4861556Z 
2025-05-07T20:32:33.4861641Z     @given(
2025-05-07T20:32:33.4861893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4862231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4862550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4863065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4863443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4863770Z     )
2025-05-07T20:32:33.4864170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4864644Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4864905Z         self,
2025-05-07T20:32:33.4865108Z         T: int,
2025-05-07T20:32:33.4865315Z         D: int,
2025-05-07T20:32:33.4865544Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4865821Z         contiguous: bool,
2025-05-07T20:32:33.4866074Z         compiled: bool,
2025-05-07T20:32:33.4866311Z     ) -> None:
2025-05-07T20:32:33.4866534Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4866789Z     
2025-05-07T20:32:33.4867076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4867433Z     
2025-05-07T20:32:33.4867641Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4867990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4868317Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4868562Z         x0 = x[:, :D]
2025-05-07T20:32:33.4868792Z         x1 = x[:, D:]
2025-05-07T20:32:33.4869008Z     
2025-05-07T20:32:33.4869197Z         if contiguous:
2025-05-07T20:32:33.4869446Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4869719Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4869962Z     
2025-05-07T20:32:33.4870164Z         if scale_ub is not None:
2025-05-07T20:32:33.4870449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4870791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4871109Z             )
2025-05-07T20:32:33.4871314Z         else:
2025-05-07T20:32:33.4871527Z             scale_ub_tensor = None
2025-05-07T20:32:33.4871792Z     
2025-05-07T20:32:33.4872034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4872361Z             op = silu_mul_quant
2025-05-07T20:32:33.4872693Z             if compiled:
2025-05-07T20:32:33.4872959Z                 op = torch.compile(op)
2025-05-07T20:32:33.4873268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4873547Z     
2025-05-07T20:32:33.4873776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4873951Z 
2025-05-07T20:32:33.4874061Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4874368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4874703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4874993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4875691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4876389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4876929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4877624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4878298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4878833Z     kernel = self.compile(
2025-05-07T20:32:33.4879449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4880169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4880578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4880813Z 
2025-05-07T20:32:33.4881030Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb0b6970>
2025-05-07T20:32:33.4882124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4883566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb08faf0>}
2025-05-07T20:32:33.4884919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4885938Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaa5dc70>
2025-05-07T20:32:33.4886241Z 
2025-05-07T20:32:33.4886411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4886946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4887421Z                            module_map=module_map)
2025-05-07T20:32:33.4887798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4888168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4888443Z E       ^
2025-05-07T20:32:33.4888905Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4889375Z 
2025-05-07T20:32:33.4889802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4890323Z 
2025-05-07T20:32:33.4890430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4890855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4891259Z     T=4096,
2025-05-07T20:32:33.4891458Z     D=5120,
2025-05-07T20:32:33.4891658Z     scale_ub=1200.0,
2025-05-07T20:32:33.4891885Z     contiguous=True,
2025-05-07T20:32:33.4892117Z     compiled=False,
2025-05-07T20:32:33.4892330Z )
2025-05-07T20:32:33.4892651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4893199Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:33.4893493Z 
2025-05-07T20:32:33.4893573Z     @given(
2025-05-07T20:32:33.4893812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4894131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4894452Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4894799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4895130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4895425Z     )
2025-05-07T20:32:33.4895784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4896238Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4896483Z         self,
2025-05-07T20:32:33.4896688Z         T: int,
2025-05-07T20:32:33.4896892Z         D: int,
2025-05-07T20:32:33.4897111Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4904872Z         contiguous: bool,
2025-05-07T20:32:33.4905253Z         compiled: bool,
2025-05-07T20:32:33.4905597Z     ) -> None:
2025-05-07T20:32:33.4905897Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4906148Z     
2025-05-07T20:32:33.4906436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4906791Z     
2025-05-07T20:32:33.4906993Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4907453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4907778Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4908030Z         x0 = x[:, :D]
2025-05-07T20:32:33.4908245Z         x1 = x[:, D:]
2025-05-07T20:32:33.4908456Z     
2025-05-07T20:32:33.4908639Z         if contiguous:
2025-05-07T20:32:33.4908865Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4909133Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4909376Z     
2025-05-07T20:32:33.4909566Z         if scale_ub is not None:
2025-05-07T20:32:33.4909848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4910192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4910556Z             )
2025-05-07T20:32:33.4910753Z         else:
2025-05-07T20:32:33.4910965Z             scale_ub_tensor = None
2025-05-07T20:32:33.4911212Z     
2025-05-07T20:32:33.4911449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4911769Z             op = silu_mul_quant
2025-05-07T20:32:33.4912035Z             if compiled:
2025-05-07T20:32:33.4912282Z                 op = torch.compile(op)
2025-05-07T20:32:33.4912585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4912864Z     
2025-05-07T20:32:33.4913051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4913223Z 
2025-05-07T20:32:33.4913325Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4913623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4913954Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4914238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4914947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4915637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4916170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4916863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4917526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4918054Z     kernel = self.compile(
2025-05-07T20:32:33.4918599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4919255Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4919654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4919886Z 
2025-05-07T20:32:33.4920148Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaa70160>
2025-05-07T20:32:33.4921226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4922627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb165ca0>}
2025-05-07T20:32:33.4923974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4924993Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaec1a30>
2025-05-07T20:32:33.4925285Z 
2025-05-07T20:32:33.4925458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4926007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4926483Z                            module_map=module_map)
2025-05-07T20:32:33.4926851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4927209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4927558Z E       ^
2025-05-07T20:32:33.4928082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4928537Z 
2025-05-07T20:32:33.4928961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4929487Z 
2025-05-07T20:32:33.4929591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4930006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4930406Z     T=1,
2025-05-07T20:32:33.4930592Z     D=5120,
2025-05-07T20:32:33.4930838Z     scale_ub=None,
2025-05-07T20:32:33.4931049Z     contiguous=True,
2025-05-07T20:32:33.4931275Z     compiled=True,
2025-05-07T20:32:33.4931488Z )
2025-05-07T20:32:34.1403344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1404768Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.1405340Z 
2025-05-07T20:32:34.1405478Z     @given(
2025-05-07T20:32:34.1405883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1406422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1406929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1407493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1408064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1408562Z     )
2025-05-07T20:32:34.1409101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1409694Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1410059Z         self,
2025-05-07T20:32:34.1410335Z         T: int,
2025-05-07T20:32:34.1410600Z         D: int,
2025-05-07T20:32:34.1410906Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1411279Z         contiguous: bool,
2025-05-07T20:32:34.1411612Z         compiled: bool,
2025-05-07T20:32:34.1411921Z     ) -> None:
2025-05-07T20:32:34.1412226Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1412563Z     
2025-05-07T20:32:34.1412924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1413396Z     
2025-05-07T20:32:34.1413664Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1414052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1414482Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1414816Z         x0 = x[:, :D]
2025-05-07T20:32:34.1415103Z         x1 = x[:, D:]
2025-05-07T20:32:34.1415392Z     
2025-05-07T20:32:34.1415652Z         if contiguous:
2025-05-07T20:32:34.1415964Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1416676Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1417026Z     
2025-05-07T20:32:34.1417284Z         if scale_ub is not None:
2025-05-07T20:32:34.1417662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1418123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1418560Z             )
2025-05-07T20:32:34.1418821Z         else:
2025-05-07T20:32:34.1419115Z             scale_ub_tensor = None
2025-05-07T20:32:34.1419465Z     
2025-05-07T20:32:34.1419773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1420214Z             op = silu_mul_quant
2025-05-07T20:32:34.1420579Z             if compiled:
2025-05-07T20:32:34.1420925Z                 op = torch.compile(op)
2025-05-07T20:32:34.1421450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1421821Z     
2025-05-07T20:32:34.1422126Z         y_fp8, y_scale = fn()
2025-05-07T20:32:34.1422599Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:34.1423042Z     
2025-05-07T20:32:34.1423402Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1423910Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:34.1424362Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:34.1425019Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:34.1425711Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.1426184Z     
2025-05-07T20:32:34.1426477Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:34.1426773Z 
2025-05-07T20:32:34.1426917Z moe/activation_test.py:126: 
2025-05-07T20:32:34.1427344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1427846Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:34.1428329Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.1429522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:34.1430854Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:34.1431697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1432817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1433864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:34.1435007Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.1436249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:34.1437570Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.1438859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:34.1439919Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:34.1441321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:34.1442207Z     fn()
2025-05-07T20:32:34.1443087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:34.1444087Z     self.fn.run(
2025-05-07T20:32:34.1444901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1445839Z     kernel = self.compile(
2025-05-07T20:32:34.1446766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1447903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1448775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1449189Z 
2025-05-07T20:32:34.1449558Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaa72c40>
2025-05-07T20:32:34.1451494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1454006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda9b6550>}
2025-05-07T20:32:34.1456397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1458217Z context = <triton._C.libtriton.ir.context object at 0x7fbfda9a0070>
2025-05-07T20:32:34.1458731Z 
2025-05-07T20:32:34.1459022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1459932Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1460741Z                            module_map=module_map)
2025-05-07T20:32:34.1461630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1462309Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:34.1462763Z E       ^
2025-05-07T20:32:34.1463581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1464390Z 
2025-05-07T20:32:34.1465094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1465984Z 
2025-05-07T20:32:34.1466151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1466837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1467579Z     T=2048,
2025-05-07T20:32:34.1467859Z     D=5120,
2025-05-07T20:32:34.1468149Z     scale_ub=None,
2025-05-07T20:32:34.1468472Z     contiguous=True,
2025-05-07T20:32:34.1468800Z     compiled=True,
2025-05-07T20:32:34.1469114Z )
2025-05-07T20:32:34.7542175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7543126Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.7543603Z 
2025-05-07T20:32:34.7543732Z     @given(
2025-05-07T20:32:34.7544129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7544642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7545156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7545712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7546255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7546736Z     )
2025-05-07T20:32:34.7547279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7547893Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7548226Z         self,
2025-05-07T20:32:34.7548495Z         T: int,
2025-05-07T20:32:34.7548757Z         D: int,
2025-05-07T20:32:34.7549055Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7549459Z         contiguous: bool,
2025-05-07T20:32:34.7549776Z         compiled: bool,
2025-05-07T20:32:34.7550088Z     ) -> None:
2025-05-07T20:32:34.7550386Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7550712Z     
2025-05-07T20:32:34.7551078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7551543Z     
2025-05-07T20:32:34.7551801Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7552192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7552612Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7552935Z         x0 = x[:, :D]
2025-05-07T20:32:34.7553234Z         x1 = x[:, D:]
2025-05-07T20:32:34.7553525Z     
2025-05-07T20:32:34.7554137Z         if contiguous:
2025-05-07T20:32:34.7554467Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7554822Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7555143Z     
2025-05-07T20:32:34.7555409Z         if scale_ub is not None:
2025-05-07T20:32:34.7555857Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7556379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7556835Z             )
2025-05-07T20:32:34.7557122Z         else:
2025-05-07T20:32:34.7557444Z             scale_ub_tensor = None
2025-05-07T20:32:34.7557822Z     
2025-05-07T20:32:34.7558197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7558708Z             op = silu_mul_quant
2025-05-07T20:32:34.7559097Z             if compiled:
2025-05-07T20:32:34.7559479Z                 op = torch.compile(op)
2025-05-07T20:32:34.7559922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7560320Z     
2025-05-07T20:32:34.7560604Z         y_fp8, y_scale = fn()
2025-05-07T20:32:34.7561010Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:34.7561422Z     
2025-05-07T20:32:34.7561782Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7562270Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:34.7562908Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:34.7563364Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:34.7563881Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.7564380Z     
2025-05-07T20:32:34.7564687Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:34.7564996Z 
2025-05-07T20:32:34.7565153Z moe/activation_test.py:126: 
2025-05-07T20:32:34.7565616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7566162Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:34.7566711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.7568056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:34.7569300Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:34.7570200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7571323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7572503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:34.7573701Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.7575007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:34.7576301Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.7577610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:34.7578708Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:34.7579752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:34.7580661Z     fn()
2025-05-07T20:32:34.7581675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:34.7582709Z     self.fn.run(
2025-05-07T20:32:34.7583518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7584468Z     kernel = self.compile(
2025-05-07T20:32:34.7585408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7586651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7587350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7587750Z 
2025-05-07T20:32:34.7588145Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdae68d30>
2025-05-07T20:32:34.7590068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7592597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda8f0f70>}
2025-05-07T20:32:34.7595028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7596805Z context = <triton._C.libtriton.ir.context object at 0x7fbfdac0c6f0>
2025-05-07T20:32:34.7597284Z 
2025-05-07T20:32:34.7597556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7598407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7599319Z                            module_map=module_map)
2025-05-07T20:32:34.7599905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7600476Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:34.7600914Z E       ^
2025-05-07T20:32:34.7601685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7602451Z 
2025-05-07T20:32:34.7603170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7604056Z 
2025-05-07T20:32:34.7604227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7605018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7605705Z     T=128,
2025-05-07T20:32:34.7606009Z     D=5120,
2025-05-07T20:32:34.7606326Z     scale_ub=None,
2025-05-07T20:32:34.7606682Z     contiguous=True,
2025-05-07T20:32:34.7607045Z     compiled=True,
2025-05-07T20:32:34.7607392Z )
2025-05-07T20:32:35.7536309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7536887Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7537172Z 
2025-05-07T20:32:35.7537260Z     @given(
2025-05-07T20:32:35.7537518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7537845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7538175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7538530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7538875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7539198Z     )
2025-05-07T20:32:35.7539572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7540027Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7540564Z         self,
2025-05-07T20:32:35.7540781Z         T: int,
2025-05-07T20:32:35.7540993Z         D: int,
2025-05-07T20:32:35.7541315Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7541607Z         contiguous: bool,
2025-05-07T20:32:35.7541866Z         compiled: bool,
2025-05-07T20:32:35.7542106Z     ) -> None:
2025-05-07T20:32:35.7542342Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7542603Z     
2025-05-07T20:32:35.7542883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7543253Z     
2025-05-07T20:32:35.7543464Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7543764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7544090Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7544646Z         x0 = x[:, :D]
2025-05-07T20:32:35.7544875Z         x1 = x[:, D:]
2025-05-07T20:32:35.7545102Z     
2025-05-07T20:32:35.7545305Z         if contiguous:
2025-05-07T20:32:35.7545546Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7545819Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7546072Z     
2025-05-07T20:32:35.7546274Z         if scale_ub is not None:
2025-05-07T20:32:35.7546562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7546916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7547233Z             )
2025-05-07T20:32:35.7547442Z         else:
2025-05-07T20:32:35.7547668Z             scale_ub_tensor = None
2025-05-07T20:32:35.7547934Z     
2025-05-07T20:32:35.7548179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7548511Z             op = silu_mul_quant
2025-05-07T20:32:35.7548779Z             if compiled:
2025-05-07T20:32:35.7549034Z                 op = torch.compile(op)
2025-05-07T20:32:35.7549357Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7549650Z     
2025-05-07T20:32:35.7549850Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7550152Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7550459Z     
2025-05-07T20:32:35.7550701Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7551211Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7551524Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7551848Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7552225Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7552552Z     
2025-05-07T20:32:35.7552776Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7552978Z 
2025-05-07T20:32:35.7553087Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7553404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7553754Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7554178Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7554985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7555757Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7556339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7557037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7557744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7558487Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7559252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7560011Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7560755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7561412Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7562029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7562564Z     fn()
2025-05-07T20:32:35.7563087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7563692Z     self.fn.run(
2025-05-07T20:32:35.7564170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7564715Z     kernel = self.compile(
2025-05-07T20:32:35.7565322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7565991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7566407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7566653Z 
2025-05-07T20:32:35.7566869Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdad3fe80>
2025-05-07T20:32:35.7567968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7569361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdac8fb80>}
2025-05-07T20:32:35.7570723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7571761Z context = <triton._C.libtriton.ir.context object at 0x7fbfda78a270>
2025-05-07T20:32:35.7572057Z 
2025-05-07T20:32:35.7572237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7572816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7573370Z                            module_map=module_map)
2025-05-07T20:32:35.7573752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7574132Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7574404Z E       ^
2025-05-07T20:32:35.7574878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7575332Z 
2025-05-07T20:32:35.7575763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7576329Z 
2025-05-07T20:32:35.7576448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7576867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7577284Z     T=4096,
2025-05-07T20:32:35.7577491Z     D=5120,
2025-05-07T20:32:35.7577690Z     scale_ub=None,
2025-05-07T20:32:35.7577925Z     contiguous=True,
2025-05-07T20:32:35.7578164Z     compiled=True,
2025-05-07T20:32:35.7578381Z )
2025-05-07T20:32:36.5873170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5873767Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.5874052Z 
2025-05-07T20:32:36.5874151Z     @given(
2025-05-07T20:32:36.5874401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5874743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5875075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5875451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5875819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5876129Z     )
2025-05-07T20:32:36.5876496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5876965Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5877234Z         self,
2025-05-07T20:32:36.5877458Z         T: int,
2025-05-07T20:32:36.5877670Z         D: int,
2025-05-07T20:32:36.5877912Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5878205Z         contiguous: bool,
2025-05-07T20:32:36.5878454Z         compiled: bool,
2025-05-07T20:32:36.5878707Z     ) -> None:
2025-05-07T20:32:36.5878944Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5879199Z     
2025-05-07T20:32:36.5879492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5879852Z     
2025-05-07T20:32:36.5880050Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.5880364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.5881022Z         x = x_sign * x_clamp
2025-05-07T20:32:36.5881282Z         x0 = x[:, :D]
2025-05-07T20:32:36.5881522Z         x1 = x[:, D:]
2025-05-07T20:32:36.5881747Z     
2025-05-07T20:32:36.5881940Z         if contiguous:
2025-05-07T20:32:36.5882188Z             x0 = x0.contiguous()
2025-05-07T20:32:36.5882471Z             x1 = x1.contiguous()
2025-05-07T20:32:36.5882721Z     
2025-05-07T20:32:36.5882931Z         if scale_ub is not None:
2025-05-07T20:32:36.5883223Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.5883577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.5883892Z             )
2025-05-07T20:32:36.5884101Z         else:
2025-05-07T20:32:36.5884331Z             scale_ub_tensor = None
2025-05-07T20:32:36.5884591Z     
2025-05-07T20:32:36.5884838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.5885170Z             op = silu_mul_quant
2025-05-07T20:32:36.5885429Z             if compiled:
2025-05-07T20:32:36.5885703Z                 op = torch.compile(op)
2025-05-07T20:32:36.5886020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.5886305Z     
2025-05-07T20:32:36.5886523Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.5886832Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.5887220Z     
2025-05-07T20:32:36.5887543Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.5887901Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.5888214Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.5888543Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.5888975Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.5889305Z     
2025-05-07T20:32:36.5889516Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.5889729Z 
2025-05-07T20:32:36.5889839Z moe/activation_test.py:126: 
2025-05-07T20:32:36.5890159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.5890593Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.5890941Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.5891752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.5892525Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.5893091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.5893792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.5894505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.5895244Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.5896007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.5896791Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.5897540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.5898187Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.5898867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.5899396Z     fn()
2025-05-07T20:32:36.5899927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.5900509Z     self.fn.run(
2025-05-07T20:32:36.5900995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.5901635Z     kernel = self.compile(
2025-05-07T20:32:36.5902231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.5902911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.5903325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.5903569Z 
2025-05-07T20:32:36.5903792Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaba2a30>
2025-05-07T20:32:36.5904874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.5906297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda81cca0>}
2025-05-07T20:32:36.5907665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.5908697Z context = <triton._C.libtriton.ir.context object at 0x7fbfda333bf0>
2025-05-07T20:32:36.5908994Z 
2025-05-07T20:32:36.5909292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.5909827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.5910305Z                            module_map=module_map)
2025-05-07T20:32:36.5910682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.5911046Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.5911324Z E       ^
2025-05-07T20:32:36.5911794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.5912241Z 
2025-05-07T20:32:36.5912671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.5913238Z 
2025-05-07T20:32:36.5913346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5913772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5914193Z     T=16384,
2025-05-07T20:32:36.5914398Z     D=5120,
2025-05-07T20:32:36.5914605Z     scale_ub=None,
2025-05-07T20:32:36.5914835Z     contiguous=True,
2025-05-07T20:32:36.5915067Z     compiled=True,
2025-05-07T20:32:36.5915292Z )
2025-05-07T20:32:36.6347964Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:36.6349252Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:36.6350623Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:36.6351620Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:36.6352760Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:36.7567219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7567798Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.7568081Z 
2025-05-07T20:32:36.7568174Z     @given(
2025-05-07T20:32:36.7568426Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7568760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7569246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7569625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7569970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7570267Z     )
2025-05-07T20:32:36.7570627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7571083Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7571337Z         self,
2025-05-07T20:32:36.7571539Z         T: int,
2025-05-07T20:32:36.7571744Z         D: int,
2025-05-07T20:32:36.7571973Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7572253Z         contiguous: bool,
2025-05-07T20:32:36.7572510Z         compiled: bool,
2025-05-07T20:32:36.7572744Z     ) -> None:
2025-05-07T20:32:36.7572969Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7573222Z     
2025-05-07T20:32:36.7573504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7573853Z     
2025-05-07T20:32:36.7574063Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7574365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7574679Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7574927Z         x0 = x[:, :D]
2025-05-07T20:32:36.7575152Z         x1 = x[:, D:]
2025-05-07T20:32:36.7575368Z     
2025-05-07T20:32:36.7575700Z         if contiguous:
2025-05-07T20:32:36.7575942Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7576212Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7576460Z     
2025-05-07T20:32:36.7576659Z         if scale_ub is not None:
2025-05-07T20:32:36.7576941Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7577280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7577601Z             )
2025-05-07T20:32:36.7577802Z         else:
2025-05-07T20:32:36.7578015Z             scale_ub_tensor = None
2025-05-07T20:32:36.7578277Z     
2025-05-07T20:32:36.7578518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7578941Z             op = silu_mul_quant
2025-05-07T20:32:36.7579222Z             if compiled:
2025-05-07T20:32:36.7579481Z                 op = torch.compile(op)
2025-05-07T20:32:36.7579780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7580065Z     
2025-05-07T20:32:36.7580263Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.7580564Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.7580857Z     
2025-05-07T20:32:36.7581198Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7581540Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.7581834Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.7582156Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.7582522Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.7582834Z     
2025-05-07T20:32:36.7583044Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.7583245Z 
2025-05-07T20:32:36.7583362Z moe/activation_test.py:126: 
2025-05-07T20:32:36.7583665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7584014Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.7584351Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.7585151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.7585903Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.7586456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7587142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7587844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.7588620Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.7589393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.7590144Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.7590877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.7591512Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.7592119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.7592642Z     fn()
2025-05-07T20:32:36.7593147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.7593728Z     self.fn.run(
2025-05-07T20:32:36.7594199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7594731Z     kernel = self.compile(
2025-05-07T20:32:36.7595270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7595922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7596424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7596660Z 
2025-05-07T20:32:36.7596874Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdae66be0>
2025-05-07T20:32:36.7597958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7599398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdaddcb80>}
2025-05-07T20:32:36.7600781Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7601797Z context = <triton._C.libtriton.ir.context object at 0x7fbfda049530>
2025-05-07T20:32:36.7602103Z 
2025-05-07T20:32:36.7602278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7602807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7603284Z                            module_map=module_map)
2025-05-07T20:32:36.7603653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7604023Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.7604298Z E       ^
2025-05-07T20:32:36.7604762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7605224Z 
2025-05-07T20:32:36.7605640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.7606157Z 
2025-05-07T20:32:36.7606264Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7606696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7607099Z     T=1,
2025-05-07T20:32:36.7607294Z     D=5120,
2025-05-07T20:32:36.7607498Z     scale_ub=1200.0,
2025-05-07T20:32:36.7607724Z     contiguous=True,
2025-05-07T20:32:36.7607958Z     compiled=True,
2025-05-07T20:32:36.7608169Z )
2025-05-07T20:32:36.9311026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.9311586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.9311853Z 
2025-05-07T20:32:36.9311937Z     @given(
2025-05-07T20:32:36.9312181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.9312658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.9312979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.9313332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.9313681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.9313973Z     )
2025-05-07T20:32:36.9314340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.9314800Z     def test_silu_mul_quant(
2025-05-07T20:32:36.9315064Z         self,
2025-05-07T20:32:36.9315268Z         T: int,
2025-05-07T20:32:36.9315483Z         D: int,
2025-05-07T20:32:36.9315718Z         scale_ub: Optional[float],
2025-05-07T20:32:36.9315997Z         contiguous: bool,
2025-05-07T20:32:36.9316256Z         compiled: bool,
2025-05-07T20:32:36.9316496Z     ) -> None:
2025-05-07T20:32:36.9316717Z         torch.manual_seed(2025)
2025-05-07T20:32:36.9316978Z     
2025-05-07T20:32:36.9317275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.9317622Z     
2025-05-07T20:32:36.9317823Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.9318126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.9318439Z         x = x_sign * x_clamp
2025-05-07T20:32:36.9318693Z         x0 = x[:, :D]
2025-05-07T20:32:36.9319078Z         x1 = x[:, D:]
2025-05-07T20:32:36.9319293Z     
2025-05-07T20:32:36.9319488Z         if contiguous:
2025-05-07T20:32:36.9319730Z             x0 = x0.contiguous()
2025-05-07T20:32:36.9319991Z             x1 = x1.contiguous()
2025-05-07T20:32:36.9320239Z     
2025-05-07T20:32:36.9320436Z         if scale_ub is not None:
2025-05-07T20:32:36.9320710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.9321051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.9321371Z             )
2025-05-07T20:32:36.9321570Z         else:
2025-05-07T20:32:36.9321783Z             scale_ub_tensor = None
2025-05-07T20:32:36.9322107Z     
2025-05-07T20:32:36.9322355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9322672Z             op = silu_mul_quant
2025-05-07T20:32:36.9322932Z             if compiled:
2025-05-07T20:32:36.9323187Z                 op = torch.compile(op)
2025-05-07T20:32:36.9323485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9323775Z     
2025-05-07T20:32:36.9323977Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.9324145Z 
2025-05-07T20:32:36.9324249Z moe/activation_test.py:117: 
2025-05-07T20:32:36.9324551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9324897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.9325187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9325747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.9326312Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.9326986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.9327674Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.9328217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.9328961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.9329642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.9330171Z     kernel = self.compile(
2025-05-07T20:32:36.9330719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.9331389Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9331790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9332037Z 
2025-05-07T20:32:36.9332296Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdab9d4c0>
2025-05-07T20:32:36.9333393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.9334785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda7eda60>}
2025-05-07T20:32:36.9336130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.9337142Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9f6a0b0>
2025-05-07T20:32:36.9337442Z 
2025-05-07T20:32:36.9337614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.9338150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9338622Z                            module_map=module_map)
2025-05-07T20:32:36.9338993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9339435Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9339706Z E       ^
2025-05-07T20:32:36.9340327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9340786Z 
2025-05-07T20:32:36.9341264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.9341780Z 
2025-05-07T20:32:36.9341887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.9342915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.9343327Z     T=1,
2025-05-07T20:32:36.9343519Z     D=5120,
2025-05-07T20:32:36.9343815Z     scale_ub=None,
2025-05-07T20:32:36.9344039Z     contiguous=False,
2025-05-07T20:32:36.9344272Z     compiled=True,
2025-05-07T20:32:36.9344482Z )
2025-05-07T20:32:37.0166218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0166771Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0167048Z 
2025-05-07T20:32:37.0167132Z     @given(
2025-05-07T20:32:37.0167373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0167704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0168014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0168355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0168702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0168991Z     )
2025-05-07T20:32:37.0169356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0169819Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0170061Z         self,
2025-05-07T20:32:37.0170264Z         T: int,
2025-05-07T20:32:37.0170471Z         D: int,
2025-05-07T20:32:37.0170692Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0170973Z         contiguous: bool,
2025-05-07T20:32:37.0171301Z         compiled: bool,
2025-05-07T20:32:37.0171589Z     ) -> None:
2025-05-07T20:32:37.0171809Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0172059Z     
2025-05-07T20:32:37.0172350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0172776Z     
2025-05-07T20:32:37.0172983Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0173287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0173604Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0173861Z         x0 = x[:, :D]
2025-05-07T20:32:37.0174092Z         x1 = x[:, D:]
2025-05-07T20:32:37.0174303Z     
2025-05-07T20:32:37.0174502Z         if contiguous:
2025-05-07T20:32:37.0174874Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0175142Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0175395Z     
2025-05-07T20:32:37.0175600Z         if scale_ub is not None:
2025-05-07T20:32:37.0175879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0176232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0176560Z             )
2025-05-07T20:32:37.0176766Z         else:
2025-05-07T20:32:37.0176982Z             scale_ub_tensor = None
2025-05-07T20:32:37.0177244Z     
2025-05-07T20:32:37.0177485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0177802Z             op = silu_mul_quant
2025-05-07T20:32:37.0178065Z             if compiled:
2025-05-07T20:32:37.0178321Z                 op = torch.compile(op)
2025-05-07T20:32:37.0178620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0178904Z     
2025-05-07T20:32:37.0179108Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0179401Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0179705Z     
2025-05-07T20:32:37.0179950Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0180284Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0180586Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0181027Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0181512Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0181828Z     
2025-05-07T20:32:37.0182039Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0182238Z 
2025-05-07T20:32:37.0182350Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0182650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0182997Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0183335Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0184128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0184969Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0185522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0186216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0186906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0187632Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0188390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.0189142Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0189865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0190513Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0191123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0191655Z     fn()
2025-05-07T20:32:37.0192170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0192765Z     self.fn.run(
2025-05-07T20:32:37.0193241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0193771Z     kernel = self.compile(
2025-05-07T20:32:37.0194321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0194980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0195436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0195676Z 
2025-05-07T20:32:37.0195888Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9fc2490>
2025-05-07T20:32:37.0196980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0198355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda1b9430>}
2025-05-07T20:32:37.0206260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0207368Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9918630>
2025-05-07T20:32:37.0207678Z 
2025-05-07T20:32:37.0207857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0208403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0208890Z                            module_map=module_map)
2025-05-07T20:32:37.0209436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0209811Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0210099Z E       ^
2025-05-07T20:32:37.0210584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0211043Z 
2025-05-07T20:32:37.0211473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0211995Z 
2025-05-07T20:32:37.0212105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0212538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0213022Z     T=1,
2025-05-07T20:32:37.0213225Z     D=5120,
2025-05-07T20:32:37.0213433Z     scale_ub=None,
2025-05-07T20:32:37.0213667Z     contiguous=True,
2025-05-07T20:32:37.0213900Z     compiled=False,
2025-05-07T20:32:37.0214128Z )
2025-05-07T20:32:37.3885244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.3886345Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.3886894Z 
2025-05-07T20:32:37.3887062Z     @given(
2025-05-07T20:32:37.3887545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.3888189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.3888825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.3889286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.3889647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.3889953Z     )
2025-05-07T20:32:37.3890329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.3890785Z     def test_silu_mul_quant(
2025-05-07T20:32:37.3891048Z         self,
2025-05-07T20:32:37.3891262Z         T: int,
2025-05-07T20:32:37.3891466Z         D: int,
2025-05-07T20:32:37.3891704Z         scale_ub: Optional[float],
2025-05-07T20:32:37.3892002Z         contiguous: bool,
2025-05-07T20:32:37.3892259Z         compiled: bool,
2025-05-07T20:32:37.3892496Z     ) -> None:
2025-05-07T20:32:37.3892734Z         torch.manual_seed(2025)
2025-05-07T20:32:37.3892992Z     
2025-05-07T20:32:37.3893275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.3893637Z     
2025-05-07T20:32:37.3893847Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.3894147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.3894477Z         x = x_sign * x_clamp
2025-05-07T20:32:37.3894736Z         x0 = x[:, :D]
2025-05-07T20:32:37.3894959Z         x1 = x[:, D:]
2025-05-07T20:32:37.3895304Z     
2025-05-07T20:32:37.3895515Z         if contiguous:
2025-05-07T20:32:37.3895756Z             x0 = x0.contiguous()
2025-05-07T20:32:37.3896029Z             x1 = x1.contiguous()
2025-05-07T20:32:37.3896281Z     
2025-05-07T20:32:37.3896476Z         if scale_ub is not None:
2025-05-07T20:32:37.3896767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.3897118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.3897441Z             )
2025-05-07T20:32:37.3897650Z         else:
2025-05-07T20:32:37.3897871Z             scale_ub_tensor = None
2025-05-07T20:32:37.3898135Z     
2025-05-07T20:32:37.3898371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.3898698Z             op = silu_mul_quant
2025-05-07T20:32:37.3898963Z             if compiled:
2025-05-07T20:32:37.3899213Z                 op = torch.compile(op)
2025-05-07T20:32:37.3899527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3899815Z     
2025-05-07T20:32:37.3900013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.3900187Z 
2025-05-07T20:32:37.3900294Z moe/activation_test.py:117: 
2025-05-07T20:32:37.3900603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3900941Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.3901461Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3902177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.3902874Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.3903419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.3904109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.3904786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.3905395Z     kernel = self.compile(
2025-05-07T20:32:37.3905956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.3906624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.3907037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3907276Z 
2025-05-07T20:32:37.3907493Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd99cafd0>
2025-05-07T20:32:37.3908584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.3909987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda1b9e50>}
2025-05-07T20:32:37.3911367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.3912396Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9e1c630>
2025-05-07T20:32:37.3912692Z 
2025-05-07T20:32:37.3912866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.3913403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.3913883Z                            module_map=module_map)
2025-05-07T20:32:37.3914257Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.3914627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.3914899Z E       ^
2025-05-07T20:32:37.3915376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.3915829Z 
2025-05-07T20:32:37.3916291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.3916823Z 
2025-05-07T20:32:37.3916932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.3917358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.3917776Z     T=128,
2025-05-07T20:32:37.3917969Z     D=5120,
2025-05-07T20:32:37.3918171Z     scale_ub=None,
2025-05-07T20:32:37.3918401Z     contiguous=False,
2025-05-07T20:32:37.3918630Z     compiled=True,
2025-05-07T20:32:37.3918845Z )
2025-05-07T20:32:37.3919197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.3919718Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.3919998Z 
2025-05-07T20:32:37.3920080Z     @given(
2025-05-07T20:32:37.3920316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.3920644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.3920966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.3921300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.3921631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.3921931Z     )
2025-05-07T20:32:37.3922333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.3922821Z     def test_silu_mul_quant(
2025-05-07T20:32:37.3923072Z         self,
2025-05-07T20:32:37.3923271Z         T: int,
2025-05-07T20:32:37.3923472Z         D: int,
2025-05-07T20:32:37.3923702Z         scale_ub: Optional[float],
2025-05-07T20:32:37.3923979Z         contiguous: bool,
2025-05-07T20:32:37.3924225Z         compiled: bool,
2025-05-07T20:32:37.3924448Z     ) -> None:
2025-05-07T20:32:37.3924673Z         torch.manual_seed(2025)
2025-05-07T20:32:37.3924921Z     
2025-05-07T20:32:37.3925194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.3925596Z     
2025-05-07T20:32:37.3925795Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.3926091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.3926409Z         x = x_sign * x_clamp
2025-05-07T20:32:37.3926661Z         x0 = x[:, :D]
2025-05-07T20:32:37.3926875Z         x1 = x[:, D:]
2025-05-07T20:32:37.3927098Z     
2025-05-07T20:32:37.3927290Z         if contiguous:
2025-05-07T20:32:37.3927519Z             x0 = x0.contiguous()
2025-05-07T20:32:37.3927785Z             x1 = x1.contiguous()
2025-05-07T20:32:37.3928033Z     
2025-05-07T20:32:37.3928225Z         if scale_ub is not None:
2025-05-07T20:32:37.3928503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.3928843Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.3929183Z             )
2025-05-07T20:32:37.3929401Z         else:
2025-05-07T20:32:37.3929620Z             scale_ub_tensor = None
2025-05-07T20:32:37.3929880Z     
2025-05-07T20:32:37.3930116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.3930440Z             op = silu_mul_quant
2025-05-07T20:32:37.3930700Z             if compiled:
2025-05-07T20:32:37.3930946Z                 op = torch.compile(op)
2025-05-07T20:32:37.3931247Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3931528Z     
2025-05-07T20:32:37.3931721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.3931896Z 
2025-05-07T20:32:37.3931998Z moe/activation_test.py:117: 
2025-05-07T20:32:37.3932302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3932634Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.3932924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3933484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.3934052Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.3934756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.3935461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.3936003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.3936685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.3937348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.3937884Z     kernel = self.compile(
2025-05-07T20:32:37.3938428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.3939078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.3939482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3939722Z 
2025-05-07T20:32:37.3939938Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd99b20d0>
2025-05-07T20:32:37.3941224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.3942709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98b8040>}
2025-05-07T20:32:37.3944061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.3945079Z context = <triton._C.libtriton.ir.context object at 0x7fbfd98feab0>
2025-05-07T20:32:37.3945374Z 
2025-05-07T20:32:37.3945551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.3946088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.3946626Z                            module_map=module_map)
2025-05-07T20:32:37.3947000Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.3947360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.3947622Z E       ^
2025-05-07T20:32:37.3948089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.3948539Z 
2025-05-07T20:32:37.3948960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.3949469Z 
2025-05-07T20:32:37.3949580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.3949990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.3950396Z     T=128,
2025-05-07T20:32:37.3950590Z     D=7168,
2025-05-07T20:32:37.3950782Z     scale_ub=1200.0,
2025-05-07T20:32:37.3951019Z     contiguous=False,
2025-05-07T20:32:37.3951247Z     compiled=False,
2025-05-07T20:32:37.3951453Z )
2025-05-07T20:32:37.5495214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5495788Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5496075Z 
2025-05-07T20:32:37.5496168Z     @given(
2025-05-07T20:32:37.5496407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5496737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5497060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5497404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5497747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5498050Z     )
2025-05-07T20:32:37.5498410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5498862Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5499239Z         self,
2025-05-07T20:32:37.5499450Z         T: int,
2025-05-07T20:32:37.5499651Z         D: int,
2025-05-07T20:32:37.5499881Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5500167Z         contiguous: bool,
2025-05-07T20:32:37.5500416Z         compiled: bool,
2025-05-07T20:32:37.5500654Z     ) -> None:
2025-05-07T20:32:37.5500892Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5501224Z     
2025-05-07T20:32:37.5501500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5501856Z     
2025-05-07T20:32:37.5502052Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5502357Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5502681Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5502940Z         x0 = x[:, :D]
2025-05-07T20:32:37.5503161Z         x1 = x[:, D:]
2025-05-07T20:32:37.5503375Z     
2025-05-07T20:32:37.5503572Z         if contiguous:
2025-05-07T20:32:37.5503807Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5504089Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5504342Z     
2025-05-07T20:32:37.5504535Z         if scale_ub is not None:
2025-05-07T20:32:37.5504818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5505169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5505552Z             )
2025-05-07T20:32:37.5505834Z         else:
2025-05-07T20:32:37.5506057Z             scale_ub_tensor = None
2025-05-07T20:32:37.5506311Z     
2025-05-07T20:32:37.5506555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5506881Z             op = silu_mul_quant
2025-05-07T20:32:37.5507134Z             if compiled:
2025-05-07T20:32:37.5507387Z                 op = torch.compile(op)
2025-05-07T20:32:37.5507695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5507981Z     
2025-05-07T20:32:37.5508171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5508348Z 
2025-05-07T20:32:37.5508455Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5508835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5509224Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5509523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5510230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5510924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5511479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5512169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5512850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5513382Z     kernel = self.compile(
2025-05-07T20:32:37.5513933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5514591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5514999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5515232Z 
2025-05-07T20:32:37.5515443Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9f10eb0>
2025-05-07T20:32:37.5516528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5517902Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98b8c10>}
2025-05-07T20:32:37.5519292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5520330Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9860bf0>
2025-05-07T20:32:37.5520626Z 
2025-05-07T20:32:37.5520794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5521330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5521802Z                            module_map=module_map)
2025-05-07T20:32:37.5522169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5522528Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5522801Z E       ^
2025-05-07T20:32:37.5523261Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5523723Z 
2025-05-07T20:32:37.5524140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5524672Z 
2025-05-07T20:32:37.5524779Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5525198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5525598Z     T=128,
2025-05-07T20:32:37.5525790Z     D=5120,
2025-05-07T20:32:37.5525988Z     scale_ub=None,
2025-05-07T20:32:37.5526343Z     contiguous=False,
2025-05-07T20:32:37.5526579Z     compiled=False,
2025-05-07T20:32:37.5526794Z )
2025-05-07T20:32:37.5527110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5527607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5527883Z 
2025-05-07T20:32:37.5527966Z     @given(
2025-05-07T20:32:37.5528207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5528524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5528836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5529176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5529554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5529844Z     )
2025-05-07T20:32:37.5530202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5530652Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5530907Z         self,
2025-05-07T20:32:37.5531110Z         T: int,
2025-05-07T20:32:37.5531306Z         D: int,
2025-05-07T20:32:37.5531534Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5531816Z         contiguous: bool,
2025-05-07T20:32:37.5532065Z         compiled: bool,
2025-05-07T20:32:37.5532289Z     ) -> None:
2025-05-07T20:32:37.5532513Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5532763Z     
2025-05-07T20:32:37.5533040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5533388Z     
2025-05-07T20:32:37.5533589Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5533887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5534207Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5534454Z         x0 = x[:, :D]
2025-05-07T20:32:37.5534668Z         x1 = x[:, D:]
2025-05-07T20:32:37.5534881Z     
2025-05-07T20:32:37.5535069Z         if contiguous:
2025-05-07T20:32:37.5535306Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5535576Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5535821Z     
2025-05-07T20:32:37.5536012Z         if scale_ub is not None:
2025-05-07T20:32:37.5536311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5536658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5536973Z             )
2025-05-07T20:32:37.5537166Z         else:
2025-05-07T20:32:37.5537385Z             scale_ub_tensor = None
2025-05-07T20:32:37.5537641Z     
2025-05-07T20:32:37.5537872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5538193Z             op = silu_mul_quant
2025-05-07T20:32:37.5538452Z             if compiled:
2025-05-07T20:32:37.5538771Z                 op = torch.compile(op)
2025-05-07T20:32:37.5539090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5539406Z     
2025-05-07T20:32:37.5539608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5539774Z 
2025-05-07T20:32:37.5539874Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5540343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5540676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5541173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5541879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5542572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5543107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5543796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5544467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5545002Z     kernel = self.compile(
2025-05-07T20:32:37.5545612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5546329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5546734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5546970Z 
2025-05-07T20:32:37.5547185Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd98d6e80>
2025-05-07T20:32:37.5548257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5549704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9bf0310>}
2025-05-07T20:32:37.5551052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5552074Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9bded30>
2025-05-07T20:32:37.5552364Z 
2025-05-07T20:32:37.5552538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5553064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5553537Z                            module_map=module_map)
2025-05-07T20:32:37.5553908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5554269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5554535Z E       ^
2025-05-07T20:32:37.5555001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5555452Z 
2025-05-07T20:32:37.5555876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5556402Z 
2025-05-07T20:32:37.5556512Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5556931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5557334Z     T=128,
2025-05-07T20:32:37.5557519Z     D=5120,
2025-05-07T20:32:37.5557714Z     scale_ub=1200.0,
2025-05-07T20:32:37.5557942Z     contiguous=True,
2025-05-07T20:32:37.5558166Z     compiled=False,
2025-05-07T20:32:37.5558375Z )
2025-05-07T20:32:37.7857021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.7857622Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.7858015Z 
2025-05-07T20:32:37.7858108Z     @given(
2025-05-07T20:32:37.7858346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.7858675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.7859009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.7859391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.7859736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.7860037Z     )
2025-05-07T20:32:37.7860396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.7860849Z     def test_silu_mul_quant(
2025-05-07T20:32:37.7861163Z         self,
2025-05-07T20:32:37.7861367Z         T: int,
2025-05-07T20:32:37.7861574Z         D: int,
2025-05-07T20:32:37.7861810Z         scale_ub: Optional[float],
2025-05-07T20:32:37.7862089Z         contiguous: bool,
2025-05-07T20:32:37.7862327Z         compiled: bool,
2025-05-07T20:32:37.7862557Z     ) -> None:
2025-05-07T20:32:37.7862786Z         torch.manual_seed(2025)
2025-05-07T20:32:37.7863029Z     
2025-05-07T20:32:37.7869351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.7869812Z     
2025-05-07T20:32:37.7870009Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.7870425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.7870808Z         x = x_sign * x_clamp
2025-05-07T20:32:37.7871050Z         x0 = x[:, :D]
2025-05-07T20:32:37.7871274Z         x1 = x[:, D:]
2025-05-07T20:32:37.7871486Z     
2025-05-07T20:32:37.7871675Z         if contiguous:
2025-05-07T20:32:37.7871915Z             x0 = x0.contiguous()
2025-05-07T20:32:37.7872181Z             x1 = x1.contiguous()
2025-05-07T20:32:37.7872422Z     
2025-05-07T20:32:37.7872620Z         if scale_ub is not None:
2025-05-07T20:32:37.7872899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.7873235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.7873623Z             )
2025-05-07T20:32:37.7873826Z         else:
2025-05-07T20:32:37.7874042Z             scale_ub_tensor = None
2025-05-07T20:32:37.7874296Z     
2025-05-07T20:32:37.7874534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.7874854Z             op = silu_mul_quant
2025-05-07T20:32:37.7875110Z             if compiled:
2025-05-07T20:32:37.7875362Z                 op = torch.compile(op)
2025-05-07T20:32:37.7875667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7875949Z     
2025-05-07T20:32:37.7876155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.7876321Z 
2025-05-07T20:32:37.7876430Z moe/activation_test.py:117: 
2025-05-07T20:32:37.7876722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7877057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.7877342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7878044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.7878746Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.7879283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.7879967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.7880640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.7881182Z     kernel = self.compile(
2025-05-07T20:32:37.7881728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.7882382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.7882774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7883009Z 
2025-05-07T20:32:37.7883263Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9bf3ac0>
2025-05-07T20:32:37.7884344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.7885743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9bf0ee0>}
2025-05-07T20:32:37.7887100Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.7888120Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9ce90f0>
2025-05-07T20:32:37.7888415Z 
2025-05-07T20:32:37.7888582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.7889111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.7889581Z                            module_map=module_map)
2025-05-07T20:32:37.7889963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.7890327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.7890663Z E       ^
2025-05-07T20:32:37.7891159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.7891621Z 
2025-05-07T20:32:37.7892038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.7892549Z 
2025-05-07T20:32:37.7892661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.7893072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.7893479Z     T=1,
2025-05-07T20:32:37.7893668Z     D=7168,
2025-05-07T20:32:37.7893868Z     scale_ub=1200.0,
2025-05-07T20:32:37.7894139Z     contiguous=True,
2025-05-07T20:32:37.7894367Z     compiled=True,
2025-05-07T20:32:37.7894578Z )
2025-05-07T20:32:37.7894897Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.7895394Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.7895659Z 
2025-05-07T20:32:37.7895744Z     @given(
2025-05-07T20:32:37.7895975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.7896295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.7896611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.7896939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.7897277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.7897566Z     )
2025-05-07T20:32:37.7897927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.7898369Z     def test_silu_mul_quant(
2025-05-07T20:32:37.7898615Z         self,
2025-05-07T20:32:37.7898814Z         T: int,
2025-05-07T20:32:37.7899007Z         D: int,
2025-05-07T20:32:37.7899258Z         scale_ub: Optional[float],
2025-05-07T20:32:37.7899556Z         contiguous: bool,
2025-05-07T20:32:37.7899798Z         compiled: bool,
2025-05-07T20:32:37.7900026Z     ) -> None:
2025-05-07T20:32:37.7900254Z         torch.manual_seed(2025)
2025-05-07T20:32:37.7900495Z     
2025-05-07T20:32:37.7900770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.7901222Z     
2025-05-07T20:32:37.7901409Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.7901703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.7902018Z         x = x_sign * x_clamp
2025-05-07T20:32:37.7902258Z         x0 = x[:, :D]
2025-05-07T20:32:37.7902481Z         x1 = x[:, D:]
2025-05-07T20:32:37.7902690Z     
2025-05-07T20:32:37.7902883Z         if contiguous:
2025-05-07T20:32:37.7903114Z             x0 = x0.contiguous()
2025-05-07T20:32:37.7903424Z             x1 = x1.contiguous()
2025-05-07T20:32:37.7903667Z     
2025-05-07T20:32:37.7903855Z         if scale_ub is not None:
2025-05-07T20:32:37.7904130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.7904468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.7904777Z             )
2025-05-07T20:32:37.7904974Z         else:
2025-05-07T20:32:37.7905188Z             scale_ub_tensor = None
2025-05-07T20:32:37.7905437Z     
2025-05-07T20:32:37.7905673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.7905989Z             op = silu_mul_quant
2025-05-07T20:32:37.7906241Z             if compiled:
2025-05-07T20:32:37.7906491Z                 op = torch.compile(op)
2025-05-07T20:32:37.7906788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7907063Z     
2025-05-07T20:32:37.7907261Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.7907433Z 
2025-05-07T20:32:37.7907534Z moe/activation_test.py:117: 
2025-05-07T20:32:37.7907843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7908177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.7908466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.7909095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.7909719Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.7910396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.7911082Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.7911617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.7912299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.7912964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.7913547Z     kernel = self.compile(
2025-05-07T20:32:37.7914084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.7914739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.7915144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.7915375Z 
2025-05-07T20:32:37.7915587Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9c53280>
2025-05-07T20:32:37.7916657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.7918027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdae2c940>}
2025-05-07T20:32:37.7919370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.7920385Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9e795f0>
2025-05-07T20:32:37.7920676Z 
2025-05-07T20:32:37.7920844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.7921362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.7921825Z                            module_map=module_map)
2025-05-07T20:32:37.7922191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.7922543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.7922804Z E       ^
2025-05-07T20:32:37.7923267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.7923760Z 
2025-05-07T20:32:37.7924183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.7924705Z 
2025-05-07T20:32:37.7924808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.7925226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.7925629Z     T=1,
2025-05-07T20:32:37.7925808Z     D=7168,
2025-05-07T20:32:37.7926001Z     scale_ub=1200.0,
2025-05-07T20:32:37.7926225Z     contiguous=False,
2025-05-07T20:32:37.7926449Z     compiled=True,
2025-05-07T20:32:37.7926654Z )
2025-05-07T20:32:38.1456442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.1457011Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.1457291Z 
2025-05-07T20:32:38.1457375Z     @given(
2025-05-07T20:32:38.1457624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.1457964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.1458292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.1458642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.1458995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.1459422Z     )
2025-05-07T20:32:38.1459849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.1460317Z     def test_silu_mul_quant(
2025-05-07T20:32:38.1460568Z         self,
2025-05-07T20:32:38.1460782Z         T: int,
2025-05-07T20:32:38.1460993Z         D: int,
2025-05-07T20:32:38.1461317Z         scale_ub: Optional[float],
2025-05-07T20:32:38.1461605Z         contiguous: bool,
2025-05-07T20:32:38.1461861Z         compiled: bool,
2025-05-07T20:32:38.1462091Z     ) -> None:
2025-05-07T20:32:38.1462325Z         torch.manual_seed(2025)
2025-05-07T20:32:38.1462579Z     
2025-05-07T20:32:38.1462859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.1463286Z     
2025-05-07T20:32:38.1463486Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.1463784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.1464096Z         x = x_sign * x_clamp
2025-05-07T20:32:38.1464346Z         x0 = x[:, :D]
2025-05-07T20:32:38.1464578Z         x1 = x[:, D:]
2025-05-07T20:32:38.1464791Z     
2025-05-07T20:32:38.1464983Z         if contiguous:
2025-05-07T20:32:38.1465225Z             x0 = x0.contiguous()
2025-05-07T20:32:38.1465487Z             x1 = x1.contiguous()
2025-05-07T20:32:38.1465737Z     
2025-05-07T20:32:38.1465941Z         if scale_ub is not None:
2025-05-07T20:32:38.1466217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.1466563Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.1466877Z             )
2025-05-07T20:32:38.1467070Z         else:
2025-05-07T20:32:38.1467285Z             scale_ub_tensor = None
2025-05-07T20:32:38.1467549Z     
2025-05-07T20:32:38.1467788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.1468111Z             op = silu_mul_quant
2025-05-07T20:32:38.1468369Z             if compiled:
2025-05-07T20:32:38.1468627Z                 op = torch.compile(op)
2025-05-07T20:32:38.1468930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.1469238Z     
2025-05-07T20:32:38.1469431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.1469605Z 
2025-05-07T20:32:38.1469712Z moe/activation_test.py:117: 
2025-05-07T20:32:38.1470017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.1470358Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.1470646Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.1471217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.1471784Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.1472518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.1473210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.1473754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.1474444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.1475115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.1475652Z     kernel = self.compile(
2025-05-07T20:32:38.1476200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.1476852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.1477256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.1477497Z 
2025-05-07T20:32:38.1477712Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9e5a8b0>
2025-05-07T20:32:38.1478804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.1480261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ba55e0>}
2025-05-07T20:32:38.1481610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.1482630Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9a87a30>
2025-05-07T20:32:38.1482925Z 
2025-05-07T20:32:38.1483104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.1483685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.1484152Z                            module_map=module_map)
2025-05-07T20:32:38.1484531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.1484888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.1485158Z E       ^
2025-05-07T20:32:38.1485625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.1486075Z 
2025-05-07T20:32:38.1486508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.1487019Z 
2025-05-07T20:32:38.1487133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.1487546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.1487951Z     T=1,
2025-05-07T20:32:38.1488142Z     D=7168,
2025-05-07T20:32:38.1488345Z     scale_ub=None,
2025-05-07T20:32:38.1488569Z     contiguous=False,
2025-05-07T20:32:38.1488804Z     compiled=True,
2025-05-07T20:32:38.1489011Z )
2025-05-07T20:32:38.2628230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2628815Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:38.2629097Z 
2025-05-07T20:32:38.2629182Z     @given(
2025-05-07T20:32:38.2629438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2629776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2630106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2630452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2630808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2631115Z     )
2025-05-07T20:32:38.2631480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2632220Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2632499Z         self,
2025-05-07T20:32:38.2632702Z         T: int,
2025-05-07T20:32:38.2632914Z         D: int,
2025-05-07T20:32:38.2633147Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2633428Z         contiguous: bool,
2025-05-07T20:32:38.2633687Z         compiled: bool,
2025-05-07T20:32:38.2633939Z     ) -> None:
2025-05-07T20:32:38.2634163Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2634425Z     
2025-05-07T20:32:38.2634712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2635072Z     
2025-05-07T20:32:38.2635272Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2635576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2635906Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2636155Z         x0 = x[:, :D]
2025-05-07T20:32:38.2636385Z         x1 = x[:, D:]
2025-05-07T20:32:38.2636607Z     
2025-05-07T20:32:38.2636798Z         if contiguous:
2025-05-07T20:32:38.2637054Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2637328Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2637574Z     
2025-05-07T20:32:38.2637779Z         if scale_ub is not None:
2025-05-07T20:32:38.2638072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2638414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2638994Z             )
2025-05-07T20:32:38.2639209Z         else:
2025-05-07T20:32:38.2639425Z             scale_ub_tensor = None
2025-05-07T20:32:38.2639691Z     
2025-05-07T20:32:38.2639937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2640532Z             op = silu_mul_quant
2025-05-07T20:32:38.2640793Z             if compiled:
2025-05-07T20:32:38.2641059Z                 op = torch.compile(op)
2025-05-07T20:32:38.2641371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2641651Z     
2025-05-07T20:32:38.2641860Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.2642165Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.2642559Z     
2025-05-07T20:32:38.2642836Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2643184Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.2643490Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.2643815Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.2644197Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.2644520Z     
2025-05-07T20:32:38.2644725Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.2644936Z 
2025-05-07T20:32:38.2645048Z moe/activation_test.py:126: 
2025-05-07T20:32:38.2645367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2645712Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.2646057Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.2646861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.2647631Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.2648192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2648896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2649653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.2650389Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.2651144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:38.2651901Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.2652707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.2653369Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.2653976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.2654512Z     fn()
2025-05-07T20:32:38.2655041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.2655629Z     self.fn.run(
2025-05-07T20:32:38.2656116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2656651Z     kernel = self.compile(
2025-05-07T20:32:38.2657192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2657865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2658285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2658521Z 
2025-05-07T20:32:38.2658744Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a569d0>
2025-05-07T20:32:38.2659935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2661512Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd998c160>}
2025-05-07T20:32:38.2662860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2663897Z context = <triton._C.libtriton.ir.context object at 0x7fbfd996aa30>
2025-05-07T20:32:38.2664236Z 
2025-05-07T20:32:38.2664424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2664957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2665438Z                            module_map=module_map)
2025-05-07T20:32:38.2665825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2666188Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.2666471Z E       ^
2025-05-07T20:32:38.2666945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2667398Z 
2025-05-07T20:32:38.2667822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2668338Z 
2025-05-07T20:32:38.2668447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2668873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2669336Z     T=1,
2025-05-07T20:32:38.2669534Z     D=5120,
2025-05-07T20:32:38.2669744Z     scale_ub=1200.0,
2025-05-07T20:32:38.2669985Z     contiguous=False,
2025-05-07T20:32:38.2670229Z     compiled=True,
2025-05-07T20:32:38.2670443Z )
2025-05-07T20:32:38.4668738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.4669512Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.4669787Z 
2025-05-07T20:32:38.4669872Z     @given(
2025-05-07T20:32:38.4670125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.4670458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.4670785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.4671130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.4671479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.4671784Z     )
2025-05-07T20:32:38.4672393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.4672865Z     def test_silu_mul_quant(
2025-05-07T20:32:38.4673131Z         self,
2025-05-07T20:32:38.4673340Z         T: int,
2025-05-07T20:32:38.4673559Z         D: int,
2025-05-07T20:32:38.4673795Z         scale_ub: Optional[float],
2025-05-07T20:32:38.4674084Z         contiguous: bool,
2025-05-07T20:32:38.4674340Z         compiled: bool,
2025-05-07T20:32:38.4674582Z     ) -> None:
2025-05-07T20:32:38.4674810Z         torch.manual_seed(2025)
2025-05-07T20:32:38.4675077Z     
2025-05-07T20:32:38.4675374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.4675739Z     
2025-05-07T20:32:38.4675942Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.4676253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.4676585Z         x = x_sign * x_clamp
2025-05-07T20:32:38.4676838Z         x0 = x[:, :D]
2025-05-07T20:32:38.4677075Z         x1 = x[:, D:]
2025-05-07T20:32:38.4677305Z     
2025-05-07T20:32:38.4677500Z         if contiguous:
2025-05-07T20:32:38.4677747Z             x0 = x0.contiguous()
2025-05-07T20:32:38.4678030Z             x1 = x1.contiguous()
2025-05-07T20:32:38.4678281Z     
2025-05-07T20:32:38.4678491Z         if scale_ub is not None:
2025-05-07T20:32:38.4678864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.4679276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.4679607Z             )
2025-05-07T20:32:38.4679820Z         else:
2025-05-07T20:32:38.4680041Z             scale_ub_tensor = None
2025-05-07T20:32:38.4680311Z     
2025-05-07T20:32:38.4680558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.4687972Z             op = silu_mul_quant
2025-05-07T20:32:38.4688289Z             if compiled:
2025-05-07T20:32:38.4688571Z                 op = torch.compile(op)
2025-05-07T20:32:38.4688896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4689187Z     
2025-05-07T20:32:38.4689537Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.4689714Z 
2025-05-07T20:32:38.4689834Z moe/activation_test.py:117: 
2025-05-07T20:32:38.4690150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4690496Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.4690804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4691391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.4691965Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.4692641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.4693343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.4693894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.4694592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.4695287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.4695837Z     kernel = self.compile(
2025-05-07T20:32:38.4696393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.4697069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.4697484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4697720Z 
2025-05-07T20:32:38.4697945Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a32310>
2025-05-07T20:32:38.4699039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.4700496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd998cb80>}
2025-05-07T20:32:38.4701960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.4703007Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9a2fbb0>
2025-05-07T20:32:38.4703303Z 
2025-05-07T20:32:38.4703488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.4704019Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.4704498Z                            module_map=module_map)
2025-05-07T20:32:38.4704894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.4705259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.4705545Z E       ^
2025-05-07T20:32:38.4706029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.4706482Z 
2025-05-07T20:32:38.4706910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.4707548Z 
2025-05-07T20:32:38.4707659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.4708098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.4708514Z     T=1,
2025-05-07T20:32:38.4708707Z     D=5120,
2025-05-07T20:32:38.4708921Z     scale_ub=1200.0,
2025-05-07T20:32:38.4709165Z     contiguous=False,
2025-05-07T20:32:38.4709404Z     compiled=False,
2025-05-07T20:32:38.4709631Z )
2025-05-07T20:32:38.4709970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.4710470Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.4710808Z 
2025-05-07T20:32:38.4710894Z     @given(
2025-05-07T20:32:38.4711142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.4711472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.4711791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.4712149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.4712506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.4712802Z     )
2025-05-07T20:32:38.4713171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.4713637Z     def test_silu_mul_quant(
2025-05-07T20:32:38.4713891Z         self,
2025-05-07T20:32:38.4714106Z         T: int,
2025-05-07T20:32:38.4714326Z         D: int,
2025-05-07T20:32:38.4714568Z         scale_ub: Optional[float],
2025-05-07T20:32:38.4714850Z         contiguous: bool,
2025-05-07T20:32:38.4715110Z         compiled: bool,
2025-05-07T20:32:38.4715355Z     ) -> None:
2025-05-07T20:32:38.4715592Z         torch.manual_seed(2025)
2025-05-07T20:32:38.4715857Z     
2025-05-07T20:32:38.4716149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.4716503Z     
2025-05-07T20:32:38.4716715Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.4717030Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.4717360Z         x = x_sign * x_clamp
2025-05-07T20:32:38.4717622Z         x0 = x[:, :D]
2025-05-07T20:32:38.4717856Z         x1 = x[:, D:]
2025-05-07T20:32:38.4718071Z     
2025-05-07T20:32:38.4718296Z         if contiguous:
2025-05-07T20:32:38.4718545Z             x0 = x0.contiguous()
2025-05-07T20:32:38.4718819Z             x1 = x1.contiguous()
2025-05-07T20:32:38.4719070Z     
2025-05-07T20:32:38.4719280Z         if scale_ub is not None:
2025-05-07T20:32:38.4719574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.4719922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.4720262Z             )
2025-05-07T20:32:38.4720523Z         else:
2025-05-07T20:32:38.4720746Z             scale_ub_tensor = None
2025-05-07T20:32:38.4721023Z     
2025-05-07T20:32:38.4721277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.4721605Z             op = silu_mul_quant
2025-05-07T20:32:38.4721883Z             if compiled:
2025-05-07T20:32:38.4722158Z                 op = torch.compile(op)
2025-05-07T20:32:38.4722466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4722770Z     
2025-05-07T20:32:38.4722984Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.4723158Z 
2025-05-07T20:32:38.4723274Z moe/activation_test.py:117: 
2025-05-07T20:32:38.4723580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4723936Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.4724237Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.4724950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.4725670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.4726231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.4726975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.4727689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.4728246Z     kernel = self.compile(
2025-05-07T20:32:38.4728812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.4729486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.4729909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.4730159Z 
2025-05-07T20:32:38.4730379Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a200a0>
2025-05-07T20:32:38.4731544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.4732946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9a30550>}
2025-05-07T20:32:38.4734302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.4735337Z context = <triton._C.libtriton.ir.context object at 0x7fbfd95413b0>
2025-05-07T20:32:38.4735643Z 
2025-05-07T20:32:38.4735822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.4736365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.4736847Z                            module_map=module_map)
2025-05-07T20:32:38.4737236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.4737617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.4737889Z E       ^
2025-05-07T20:32:38.4738373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.4738835Z 
2025-05-07T20:32:38.4739273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.4739792Z 
2025-05-07T20:32:38.4739901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.4740634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.4741110Z     T=16384,
2025-05-07T20:32:38.4741312Z     D=5120,
2025-05-07T20:32:38.4741521Z     scale_ub=1200.0,
2025-05-07T20:32:38.4741849Z     contiguous=False,
2025-05-07T20:32:38.4742084Z     compiled=True,
2025-05-07T20:32:38.4742302Z )
2025-05-07T20:32:38.5914223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.5914765Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.5915078Z 
2025-05-07T20:32:38.5915162Z     @given(
2025-05-07T20:32:38.5915414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.5915745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.5916073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.5916418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.5916766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.5917070Z     )
2025-05-07T20:32:38.5917429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.5917887Z     def test_silu_mul_quant(
2025-05-07T20:32:38.5918159Z         self,
2025-05-07T20:32:38.5918363Z         T: int,
2025-05-07T20:32:38.5918574Z         D: int,
2025-05-07T20:32:38.5918809Z         scale_ub: Optional[float],
2025-05-07T20:32:38.5919091Z         contiguous: bool,
2025-05-07T20:32:38.5919347Z         compiled: bool,
2025-05-07T20:32:38.5919591Z     ) -> None:
2025-05-07T20:32:38.5920138Z         torch.manual_seed(2025)
2025-05-07T20:32:38.5920404Z     
2025-05-07T20:32:38.5920696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.5921067Z     
2025-05-07T20:32:38.5921276Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.5921587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.5921922Z         x = x_sign * x_clamp
2025-05-07T20:32:38.5922173Z         x0 = x[:, :D]
2025-05-07T20:32:38.5922405Z         x1 = x[:, D:]
2025-05-07T20:32:38.5922631Z     
2025-05-07T20:32:38.5922826Z         if contiguous:
2025-05-07T20:32:38.5923075Z             x0 = x0.contiguous()
2025-05-07T20:32:38.5923442Z             x1 = x1.contiguous()
2025-05-07T20:32:38.5923694Z     
2025-05-07T20:32:38.5923905Z         if scale_ub is not None:
2025-05-07T20:32:38.5924202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.5924550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.5924881Z             )
2025-05-07T20:32:38.5925098Z         else:
2025-05-07T20:32:38.5925319Z             scale_ub_tensor = None
2025-05-07T20:32:38.5925589Z     
2025-05-07T20:32:38.5925839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5926166Z             op = silu_mul_quant
2025-05-07T20:32:38.5926437Z             if compiled:
2025-05-07T20:32:38.5926705Z                 op = torch.compile(op)
2025-05-07T20:32:38.5927025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5927310Z     
2025-05-07T20:32:38.5927518Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.5927689Z 
2025-05-07T20:32:38.5927805Z moe/activation_test.py:117: 
2025-05-07T20:32:38.5928115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5928470Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.5928797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5929400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.5929988Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.5930663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.5931361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.5931913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.5932596Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.5933367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.5933927Z     kernel = self.compile(
2025-05-07T20:32:38.5934478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.5935154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5935576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5935815Z 
2025-05-07T20:32:38.5936039Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9522b50>
2025-05-07T20:32:38.5937134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.5938526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ae41f0>}
2025-05-07T20:32:38.5939883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.5941453Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9af24b0>
2025-05-07T20:32:38.5941799Z 
2025-05-07T20:32:38.5941977Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.5942521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5943007Z                            module_map=module_map)
2025-05-07T20:32:38.5943394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5943756Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.5944033Z E       ^
2025-05-07T20:32:38.5944516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.5945029Z 
2025-05-07T20:32:38.5945448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.5945974Z 
2025-05-07T20:32:38.5946084Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.5946519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.5946936Z     T=2048,
2025-05-07T20:32:38.5947134Z     D=7168,
2025-05-07T20:32:38.5947343Z     scale_ub=1200.0,
2025-05-07T20:32:38.5947586Z     contiguous=False,
2025-05-07T20:32:38.5947820Z     compiled=True,
2025-05-07T20:32:38.5948041Z )
2025-05-07T20:32:38.5948374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.5948875Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.5949167Z 
2025-05-07T20:32:38.5949250Z     @given(
2025-05-07T20:32:38.5949497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.5949832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.5950146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.5950496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.5950843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.5951139Z     )
2025-05-07T20:32:38.5951503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.5951961Z     def test_silu_mul_quant(
2025-05-07T20:32:38.5952211Z         self,
2025-05-07T20:32:38.5952426Z         T: int,
2025-05-07T20:32:38.5952639Z         D: int,
2025-05-07T20:32:38.5952869Z         scale_ub: Optional[float],
2025-05-07T20:32:38.5953158Z         contiguous: bool,
2025-05-07T20:32:38.5953415Z         compiled: bool,
2025-05-07T20:32:38.5953646Z     ) -> None:
2025-05-07T20:32:38.5953884Z         torch.manual_seed(2025)
2025-05-07T20:32:38.5954148Z     
2025-05-07T20:32:38.5954501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.5954859Z     
2025-05-07T20:32:38.5955073Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.5955383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.5955703Z         x = x_sign * x_clamp
2025-05-07T20:32:38.5955956Z         x0 = x[:, :D]
2025-05-07T20:32:38.5956193Z         x1 = x[:, D:]
2025-05-07T20:32:38.5956406Z     
2025-05-07T20:32:38.5956603Z         if contiguous:
2025-05-07T20:32:38.5956847Z             x0 = x0.contiguous()
2025-05-07T20:32:38.5957114Z             x1 = x1.contiguous()
2025-05-07T20:32:38.5957369Z     
2025-05-07T20:32:38.5957575Z         if scale_ub is not None:
2025-05-07T20:32:38.5957858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.5958208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.5958532Z             )
2025-05-07T20:32:38.5958731Z         else:
2025-05-07T20:32:38.5958955Z             scale_ub_tensor = None
2025-05-07T20:32:38.5959228Z     
2025-05-07T20:32:38.5959473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5959803Z             op = silu_mul_quant
2025-05-07T20:32:38.5960069Z             if compiled:
2025-05-07T20:32:38.5960331Z                 op = torch.compile(op)
2025-05-07T20:32:38.5960631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5961029Z     
2025-05-07T20:32:38.5961237Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.5961410Z 
2025-05-07T20:32:38.5961514Z moe/activation_test.py:117: 
2025-05-07T20:32:38.5961826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5962171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.5962461Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5963026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.5963591Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.5964263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.5965005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.5965553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.5966246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.5966910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.5967452Z     kernel = self.compile(
2025-05-07T20:32:38.5968004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.5968666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5969071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5969313Z 
2025-05-07T20:32:38.5969567Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9afbd60>
2025-05-07T20:32:38.5970670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.5972050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ae4ee0>}
2025-05-07T20:32:38.5973397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.5974416Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9823370>
2025-05-07T20:32:38.5974718Z 
2025-05-07T20:32:38.5974890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.5975483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5975955Z                            module_map=module_map)
2025-05-07T20:32:38.5976334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5976704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.5976975Z E       ^
2025-05-07T20:32:38.5977440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.5977898Z 
2025-05-07T20:32:38.5978319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.5978833Z 
2025-05-07T20:32:38.8662913Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.8663434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.8663854Z     T=1,
2025-05-07T20:32:38.8664049Z     D=5120,
2025-05-07T20:32:38.8664283Z     scale_ub=None,
2025-05-07T20:32:38.8664514Z     contiguous=False,
2025-05-07T20:32:38.8664746Z     compiled=False,
2025-05-07T20:32:38.8664970Z )
2025-05-07T20:32:38.8665301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.8666084Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.8666440Z 
2025-05-07T20:32:38.8666524Z     @given(
2025-05-07T20:32:38.8666764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.8667087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.8667411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.8667755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.8668106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.8668401Z     )
2025-05-07T20:32:38.8668766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.8669320Z     def test_silu_mul_quant(
2025-05-07T20:32:38.8669609Z         self,
2025-05-07T20:32:38.8669827Z         T: int,
2025-05-07T20:32:38.8670040Z         D: int,
2025-05-07T20:32:38.8670264Z         scale_ub: Optional[float],
2025-05-07T20:32:38.8670547Z         contiguous: bool,
2025-05-07T20:32:38.8670798Z         compiled: bool,
2025-05-07T20:32:38.8671038Z     ) -> None:
2025-05-07T20:32:38.8671269Z         torch.manual_seed(2025)
2025-05-07T20:32:38.8671523Z     
2025-05-07T20:32:38.8671801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.8672157Z     
2025-05-07T20:32:38.8672363Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.8672662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.8672991Z         x = x_sign * x_clamp
2025-05-07T20:32:38.8673244Z         x0 = x[:, :D]
2025-05-07T20:32:38.8673473Z         x1 = x[:, D:]
2025-05-07T20:32:38.8673686Z     
2025-05-07T20:32:38.8673882Z         if contiguous:
2025-05-07T20:32:38.8674139Z             x0 = x0.contiguous()
2025-05-07T20:32:38.8674409Z             x1 = x1.contiguous()
2025-05-07T20:32:38.8674665Z     
2025-05-07T20:32:38.8674875Z         if scale_ub is not None:
2025-05-07T20:32:38.8675158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.8675511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.8675839Z             )
2025-05-07T20:32:38.8676040Z         else:
2025-05-07T20:32:38.8676262Z             scale_ub_tensor = None
2025-05-07T20:32:38.8676528Z     
2025-05-07T20:32:38.8676765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.8677105Z             op = silu_mul_quant
2025-05-07T20:32:38.8677374Z             if compiled:
2025-05-07T20:32:38.8677630Z                 op = torch.compile(op)
2025-05-07T20:32:38.8677941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8678231Z     
2025-05-07T20:32:38.8678430Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.8678610Z 
2025-05-07T20:32:38.8678804Z moe/activation_test.py:117: 
2025-05-07T20:32:38.8679119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.8679465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.8679751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8680460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.8681174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.8681720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.8682415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.8683092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.8683635Z     kernel = self.compile(
2025-05-07T20:32:38.8684188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.8684857Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.8685267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.8685504Z 
2025-05-07T20:32:38.8685807Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9846670>
2025-05-07T20:32:38.8686897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.8688300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98475e0>}
2025-05-07T20:32:38.8689670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.8690739Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9b1aaf0>
2025-05-07T20:32:38.8691034Z 
2025-05-07T20:32:38.8691213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.8691750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.8692225Z                            module_map=module_map)
2025-05-07T20:32:38.8692606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.8692964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.8693240Z E       ^
2025-05-07T20:32:38.8693714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.8694165Z 
2025-05-07T20:32:38.8694605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.8695124Z 
2025-05-07T20:32:38.8695234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.8695660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.8696076Z     T=4096,
2025-05-07T20:32:38.8696275Z     D=7168,
2025-05-07T20:32:38.8696485Z     scale_ub=1200.0,
2025-05-07T20:32:38.8696728Z     contiguous=False,
2025-05-07T20:32:38.8696960Z     compiled=False,
2025-05-07T20:32:38.8697179Z )
2025-05-07T20:32:38.8697505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.8705334Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.8705631Z 
2025-05-07T20:32:38.8705717Z     @given(
2025-05-07T20:32:38.8705970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.8706308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.8706623Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.8707064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.8707422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.8707717Z     )
2025-05-07T20:32:38.8708089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.8708564Z     def test_silu_mul_quant(
2025-05-07T20:32:38.8708817Z         self,
2025-05-07T20:32:38.8709031Z         T: int,
2025-05-07T20:32:38.8709243Z         D: int,
2025-05-07T20:32:38.8709477Z         scale_ub: Optional[float],
2025-05-07T20:32:38.8709754Z         contiguous: bool,
2025-05-07T20:32:38.8710013Z         compiled: bool,
2025-05-07T20:32:38.8710254Z     ) -> None:
2025-05-07T20:32:38.8710486Z         torch.manual_seed(2025)
2025-05-07T20:32:38.8710743Z     
2025-05-07T20:32:38.8711034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.8711383Z     
2025-05-07T20:32:38.8711591Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.8711907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.8712234Z         x = x_sign * x_clamp
2025-05-07T20:32:38.8712495Z         x0 = x[:, :D]
2025-05-07T20:32:38.8712729Z         x1 = x[:, D:]
2025-05-07T20:32:38.8712947Z     
2025-05-07T20:32:38.8713152Z         if contiguous:
2025-05-07T20:32:38.8713485Z             x0 = x0.contiguous()
2025-05-07T20:32:38.8713752Z             x1 = x1.contiguous()
2025-05-07T20:32:38.8714015Z     
2025-05-07T20:32:38.8714224Z         if scale_ub is not None:
2025-05-07T20:32:38.8714510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.8714863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.8715193Z             )
2025-05-07T20:32:38.8715405Z         else:
2025-05-07T20:32:38.8715625Z             scale_ub_tensor = None
2025-05-07T20:32:38.8715895Z     
2025-05-07T20:32:38.8716145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.8716470Z             op = silu_mul_quant
2025-05-07T20:32:38.8716793Z             if compiled:
2025-05-07T20:32:38.8717060Z                 op = torch.compile(op)
2025-05-07T20:32:38.8717367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8717659Z     
2025-05-07T20:32:38.8717868Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.8718041Z 
2025-05-07T20:32:38.8718149Z moe/activation_test.py:117: 
2025-05-07T20:32:38.8718466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.8718818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.8719117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.8719814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.8720517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.8721069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.8721757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.8722464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.8723009Z     kernel = self.compile(
2025-05-07T20:32:38.8723568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.8724243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.8724654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.8724900Z 
2025-05-07T20:32:38.8725116Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9490430>
2025-05-07T20:32:38.8726286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.8727664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd97161f0>}
2025-05-07T20:32:38.8729026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.8730119Z context = <triton._C.libtriton.ir.context object at 0x7fbfd973fe70>
2025-05-07T20:32:38.8730412Z 
2025-05-07T20:32:38.8730590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.8731125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.8731592Z                            module_map=module_map)
2025-05-07T20:32:38.8731971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.8732344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.8732609Z E       ^
2025-05-07T20:32:38.8733078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.8733526Z 
2025-05-07T20:32:38.8733995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.8734554Z 
2025-05-07T20:32:38.8734669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.8735082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.8735496Z     T=16384,
2025-05-07T20:32:38.8735700Z     D=7168,
2025-05-07T20:32:38.8735898Z     scale_ub=None,
2025-05-07T20:32:38.8736124Z     contiguous=True,
2025-05-07T20:32:38.8736361Z     compiled=True,
2025-05-07T20:32:38.8736565Z )
2025-05-07T20:32:39.1685241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1685846Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.1686425Z 
2025-05-07T20:32:39.1686514Z     @given(
2025-05-07T20:32:39.1686764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1687100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1687422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1687787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1688136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1688434Z     )
2025-05-07T20:32:39.1688795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1689247Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1689521Z         self,
2025-05-07T20:32:39.1689756Z         T: int,
2025-05-07T20:32:39.1689969Z         D: int,
2025-05-07T20:32:39.1690193Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1690478Z         contiguous: bool,
2025-05-07T20:32:39.1690733Z         compiled: bool,
2025-05-07T20:32:39.1690972Z     ) -> None:
2025-05-07T20:32:39.1691206Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1691463Z     
2025-05-07T20:32:39.1691744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1692104Z     
2025-05-07T20:32:39.1692314Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1692627Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1692947Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1693202Z         x0 = x[:, :D]
2025-05-07T20:32:39.1693432Z         x1 = x[:, D:]
2025-05-07T20:32:39.1693645Z     
2025-05-07T20:32:39.1693843Z         if contiguous:
2025-05-07T20:32:39.1694094Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1694361Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1694622Z     
2025-05-07T20:32:39.1694819Z         if scale_ub is not None:
2025-05-07T20:32:39.1695111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1695546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1695870Z             )
2025-05-07T20:32:39.1696076Z         else:
2025-05-07T20:32:39.1696301Z             scale_ub_tensor = None
2025-05-07T20:32:39.1696560Z     
2025-05-07T20:32:39.1696803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1697130Z             op = silu_mul_quant
2025-05-07T20:32:39.1697403Z             if compiled:
2025-05-07T20:32:39.1697660Z                 op = torch.compile(op)
2025-05-07T20:32:39.1697967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1698259Z     
2025-05-07T20:32:39.1698455Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.1698632Z 
2025-05-07T20:32:39.1698736Z moe/activation_test.py:117: 
2025-05-07T20:32:39.1699041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1699382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.1699680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1700250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.1700817Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.1701568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.1702416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.1702968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1703652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1704325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1704866Z     kernel = self.compile(
2025-05-07T20:32:39.1705417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1706076Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1706530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1706765Z 
2025-05-07T20:32:39.1706980Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd94c8850>
2025-05-07T20:32:39.1708075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1709457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9716ee0>}
2025-05-07T20:32:39.1710821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1711851Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9b601f0>
2025-05-07T20:32:39.1712143Z 
2025-05-07T20:32:39.1712321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1712858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1713351Z                            module_map=module_map)
2025-05-07T20:32:39.1713734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1714099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1714362Z E       ^
2025-05-07T20:32:39.1714830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1715276Z 
2025-05-07T20:32:39.1715703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1716216Z 
2025-05-07T20:32:39.1716332Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1716797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1717215Z     T=4096,
2025-05-07T20:32:39.1717411Z     D=5120,
2025-05-07T20:32:39.1717613Z     scale_ub=None,
2025-05-07T20:32:39.1717842Z     contiguous=False,
2025-05-07T20:32:39.1718080Z     compiled=True,
2025-05-07T20:32:39.1718292Z )
2025-05-07T20:32:39.1718631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1719148Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.1719451Z 
2025-05-07T20:32:39.1719540Z     @given(
2025-05-07T20:32:39.1719804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1720128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1720447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1720784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1721133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1721433Z     )
2025-05-07T20:32:39.1721790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1722252Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1722510Z         self,
2025-05-07T20:32:39.1722712Z         T: int,
2025-05-07T20:32:39.1723001Z         D: int,
2025-05-07T20:32:39.1723237Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1723513Z         contiguous: bool,
2025-05-07T20:32:39.1723767Z         compiled: bool,
2025-05-07T20:32:39.1724005Z     ) -> None:
2025-05-07T20:32:39.1724230Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1724485Z     
2025-05-07T20:32:39.1724768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1725111Z     
2025-05-07T20:32:39.1725316Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1725621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1725943Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1726238Z         x0 = x[:, :D]
2025-05-07T20:32:39.1726486Z         x1 = x[:, D:]
2025-05-07T20:32:39.1726698Z     
2025-05-07T20:32:39.1726894Z         if contiguous:
2025-05-07T20:32:39.1727138Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1727403Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1727666Z     
2025-05-07T20:32:39.1727875Z         if scale_ub is not None:
2025-05-07T20:32:39.1728158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1728510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1728832Z             )
2025-05-07T20:32:39.1729036Z         else:
2025-05-07T20:32:39.1729253Z             scale_ub_tensor = None
2025-05-07T20:32:39.1729519Z     
2025-05-07T20:32:39.1729766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1730089Z             op = silu_mul_quant
2025-05-07T20:32:39.1730353Z             if compiled:
2025-05-07T20:32:39.1730614Z                 op = torch.compile(op)
2025-05-07T20:32:39.1730928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1731214Z     
2025-05-07T20:32:39.1731416Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.1731586Z 
2025-05-07T20:32:39.1731689Z moe/activation_test.py:117: 
2025-05-07T20:32:39.1731990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1732338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.1732633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1733196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.1733762Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.1734433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.1735127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.1735723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1736424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1737096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1737632Z     kernel = self.compile(
2025-05-07T20:32:39.1738212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1738888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1739296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1739532Z 
2025-05-07T20:32:39.1739748Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9b7ad90>
2025-05-07T20:32:39.1741177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1742555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9b87940>}
2025-05-07T20:32:39.1743969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1745057Z context = <triton._C.libtriton.ir.context object at 0x7fbfd96d52f0>
2025-05-07T20:32:39.1745360Z 
2025-05-07T20:32:39.1745531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1746069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1746549Z                            module_map=module_map)
2025-05-07T20:32:39.1746924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1747398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1747675Z E       ^
2025-05-07T20:32:39.1748142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1748604Z 
2025-05-07T20:32:39.1749027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1749590Z 
2025-05-07T20:32:39.3711025Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3711525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3711951Z     T=4096,
2025-05-07T20:32:39.3712151Z     D=5120,
2025-05-07T20:32:39.3712350Z     scale_ub=1200.0,
2025-05-07T20:32:39.3712588Z     contiguous=False,
2025-05-07T20:32:39.3712825Z     compiled=False,
2025-05-07T20:32:39.3713037Z )
2025-05-07T20:32:39.3713366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3713905Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.3714187Z 
2025-05-07T20:32:39.3714270Z     @given(
2025-05-07T20:32:39.3714511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3714834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3715166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3715502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3715843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3716140Z     )
2025-05-07T20:32:39.3716500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3716954Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3717213Z         self,
2025-05-07T20:32:39.3717413Z         T: int,
2025-05-07T20:32:39.3717627Z         D: int,
2025-05-07T20:32:39.3717859Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3718135Z         contiguous: bool,
2025-05-07T20:32:39.3718685Z         compiled: bool,
2025-05-07T20:32:39.3718932Z     ) -> None:
2025-05-07T20:32:39.3719155Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3719419Z     
2025-05-07T20:32:39.3719706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3720060Z     
2025-05-07T20:32:39.3720260Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3720563Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3720889Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3721133Z         x0 = x[:, :D]
2025-05-07T20:32:39.3721362Z         x1 = x[:, D:]
2025-05-07T20:32:39.3721582Z     
2025-05-07T20:32:39.3721775Z         if contiguous:
2025-05-07T20:32:39.3722024Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3722296Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3722543Z     
2025-05-07T20:32:39.3722745Z         if scale_ub is not None:
2025-05-07T20:32:39.3723031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3723383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3723712Z             )
2025-05-07T20:32:39.3723922Z         else:
2025-05-07T20:32:39.3724137Z             scale_ub_tensor = None
2025-05-07T20:32:39.3724402Z     
2025-05-07T20:32:39.3724643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3725133Z             op = silu_mul_quant
2025-05-07T20:32:39.3725390Z             if compiled:
2025-05-07T20:32:39.3725651Z                 op = torch.compile(op)
2025-05-07T20:32:39.3725956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3726237Z     
2025-05-07T20:32:39.3726439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3726605Z 
2025-05-07T20:32:39.3726718Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3727015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3727358Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3727655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3728437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3729136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3729693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3730390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3731051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3731590Z     kernel = self.compile(
2025-05-07T20:32:39.3732150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3732812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3733213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3733463Z 
2025-05-07T20:32:39.3733674Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd95e06d0>
2025-05-07T20:32:39.3734768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3736159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd95ef3a0>}
2025-05-07T20:32:39.3737497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3738521Z context = <triton._C.libtriton.ir.context object at 0x7fbfd95e32b0>
2025-05-07T20:32:39.3738819Z 
2025-05-07T20:32:39.3739039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3739582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3740398Z                            module_map=module_map)
2025-05-07T20:32:39.3740803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3741242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3741507Z E       ^
2025-05-07T20:32:39.3741982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3742439Z 
2025-05-07T20:32:39.3742865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3743378Z 
2025-05-07T20:32:39.3743492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3743932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3744344Z     T=4096,
2025-05-07T20:32:39.3744549Z     D=5120,
2025-05-07T20:32:39.3744756Z     scale_ub=1200.0,
2025-05-07T20:32:39.3744986Z     contiguous=False,
2025-05-07T20:32:39.3745223Z     compiled=True,
2025-05-07T20:32:39.3745437Z )
2025-05-07T20:32:39.3745759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3746469Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.3746758Z 
2025-05-07T20:32:39.3746838Z     @given(
2025-05-07T20:32:39.3747077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3747393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3747712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3748049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3748386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3748679Z     )
2025-05-07T20:32:39.3749042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3749549Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3749801Z         self,
2025-05-07T20:32:39.3750004Z         T: int,
2025-05-07T20:32:39.3750211Z         D: int,
2025-05-07T20:32:39.3750432Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3750714Z         contiguous: bool,
2025-05-07T20:32:39.3750968Z         compiled: bool,
2025-05-07T20:32:39.3751195Z     ) -> None:
2025-05-07T20:32:39.3751422Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3751684Z     
2025-05-07T20:32:39.3751959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3752311Z     
2025-05-07T20:32:39.3752512Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3752809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3753130Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3753381Z         x0 = x[:, :D]
2025-05-07T20:32:39.3753598Z         x1 = x[:, D:]
2025-05-07T20:32:39.3753816Z     
2025-05-07T20:32:39.3754018Z         if contiguous:
2025-05-07T20:32:39.3754250Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3754519Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3754773Z     
2025-05-07T20:32:39.3754967Z         if scale_ub is not None:
2025-05-07T20:32:39.3755254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3755606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3755929Z             )
2025-05-07T20:32:39.3756124Z         else:
2025-05-07T20:32:39.3756348Z             scale_ub_tensor = None
2025-05-07T20:32:39.3756611Z     
2025-05-07T20:32:39.3756848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3757184Z             op = silu_mul_quant
2025-05-07T20:32:39.3757445Z             if compiled:
2025-05-07T20:32:39.3757694Z                 op = torch.compile(op)
2025-05-07T20:32:39.3765705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3766052Z     
2025-05-07T20:32:39.3766369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3766563Z 
2025-05-07T20:32:39.3766674Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3766987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3767335Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3767630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3768221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.3768795Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.3769468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3770164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3770713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3771408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3772087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3772632Z     kernel = self.compile(
2025-05-07T20:32:39.3773188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3773948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3774356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3774602Z 
2025-05-07T20:32:39.3774818Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd95fa070>
2025-05-07T20:32:39.3775907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3777301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd95ef280>}
2025-05-07T20:32:39.3778709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3779805Z context = <triton._C.libtriton.ir.context object at 0x7fbfd97da130>
2025-05-07T20:32:39.3780106Z 
2025-05-07T20:32:39.3780278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3780822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3781361Z                            module_map=module_map)
2025-05-07T20:32:39.3781747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3782117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3782387Z E       ^
2025-05-07T20:32:39.3782878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3783343Z 
2025-05-07T20:32:39.3783762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3784290Z 
2025-05-07T20:32:39.6536857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6537350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6537875Z     T=2048,
2025-05-07T20:32:39.6538148Z     D=7168,
2025-05-07T20:32:39.6538481Z     scale_ub=1200.0,
2025-05-07T20:32:39.6538829Z     contiguous=False,
2025-05-07T20:32:39.6539134Z     compiled=False,
2025-05-07T20:32:39.6539422Z )
2025-05-07T20:32:39.6539909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.6540835Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.6541200Z 
2025-05-07T20:32:39.6541580Z     @given(
2025-05-07T20:32:39.6541831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.6542152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.6542474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.6542820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.6543177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.6543473Z     )
2025-05-07T20:32:39.6543835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.6544291Z     def test_silu_mul_quant(
2025-05-07T20:32:39.6544539Z         self,
2025-05-07T20:32:39.6544749Z         T: int,
2025-05-07T20:32:39.6544960Z         D: int,
2025-05-07T20:32:39.6545182Z         scale_ub: Optional[float],
2025-05-07T20:32:39.6545466Z         contiguous: bool,
2025-05-07T20:32:39.6545718Z         compiled: bool,
2025-05-07T20:32:39.6545945Z     ) -> None:
2025-05-07T20:32:39.6546179Z         torch.manual_seed(2025)
2025-05-07T20:32:39.6546434Z     
2025-05-07T20:32:39.6546710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.6547065Z     
2025-05-07T20:32:39.6547267Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.6547562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.6548044Z         x = x_sign * x_clamp
2025-05-07T20:32:39.6548300Z         x0 = x[:, :D]
2025-05-07T20:32:39.6548522Z         x1 = x[:, D:]
2025-05-07T20:32:39.6548741Z     
2025-05-07T20:32:39.6548938Z         if contiguous:
2025-05-07T20:32:39.6549183Z             x0 = x0.contiguous()
2025-05-07T20:32:39.6549449Z             x1 = x1.contiguous()
2025-05-07T20:32:39.6549733Z     
2025-05-07T20:32:39.6549959Z         if scale_ub is not None:
2025-05-07T20:32:39.6550240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.6550596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.6550918Z             )
2025-05-07T20:32:39.6551203Z         else:
2025-05-07T20:32:39.6551432Z             scale_ub_tensor = None
2025-05-07T20:32:39.6551700Z     
2025-05-07T20:32:39.6551936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.6552266Z             op = silu_mul_quant
2025-05-07T20:32:39.6552529Z             if compiled:
2025-05-07T20:32:39.6552787Z                 op = torch.compile(op)
2025-05-07T20:32:39.6553096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6553388Z     
2025-05-07T20:32:39.6553587Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.6553766Z 
2025-05-07T20:32:39.6553871Z moe/activation_test.py:117: 
2025-05-07T20:32:39.6554182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6554527Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.6554819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6555540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.6556246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.6556800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.6557510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.6558194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.6558739Z     kernel = self.compile(
2025-05-07T20:32:39.6559295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.6559962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.6560378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6560616Z 
2025-05-07T20:32:39.6560840Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd97ce760>
2025-05-07T20:32:39.6561994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.6563401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd97ed670>}
2025-05-07T20:32:39.6564772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.6565802Z context = <triton._C.libtriton.ir.context object at 0x7fbfd96622f0>
2025-05-07T20:32:39.6566099Z 
2025-05-07T20:32:39.6566271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.6566823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.6567308Z                            module_map=module_map)
2025-05-07T20:32:39.6567687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.6568051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.6568322Z E       ^
2025-05-07T20:32:39.6568875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.6569333Z 
2025-05-07T20:32:39.6569775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.6570323Z 
2025-05-07T20:32:39.6570431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6570880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6571290Z     T=1,
2025-05-07T20:32:39.6571483Z     D=7168,
2025-05-07T20:32:39.6571677Z     scale_ub=None,
2025-05-07T20:32:39.6571900Z     contiguous=True,
2025-05-07T20:32:39.6572203Z     compiled=False,
2025-05-07T20:32:39.6572413Z )
2025-05-07T20:32:39.6572739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.6573245Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.6573508Z 
2025-05-07T20:32:39.6573594Z     @given(
2025-05-07T20:32:39.6573838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.6574160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.6574481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.6574815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.6575153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.6575447Z     )
2025-05-07T20:32:39.6575797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.6576257Z     def test_silu_mul_quant(
2025-05-07T20:32:39.6576506Z         self,
2025-05-07T20:32:39.6576706Z         T: int,
2025-05-07T20:32:39.6576912Z         D: int,
2025-05-07T20:32:39.6577139Z         scale_ub: Optional[float],
2025-05-07T20:32:39.6577416Z         contiguous: bool,
2025-05-07T20:32:39.6577666Z         compiled: bool,
2025-05-07T20:32:39.6577898Z     ) -> None:
2025-05-07T20:32:39.6578116Z         torch.manual_seed(2025)
2025-05-07T20:32:39.6578374Z     
2025-05-07T20:32:39.6578655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.6578998Z     
2025-05-07T20:32:39.6579201Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.6579507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.6579831Z         x = x_sign * x_clamp
2025-05-07T20:32:39.6580074Z         x0 = x[:, :D]
2025-05-07T20:32:39.6580302Z         x1 = x[:, D:]
2025-05-07T20:32:39.6580520Z     
2025-05-07T20:32:39.6580709Z         if contiguous:
2025-05-07T20:32:39.6580949Z             x0 = x0.contiguous()
2025-05-07T20:32:39.6581301Z             x1 = x1.contiguous()
2025-05-07T20:32:39.6581600Z     
2025-05-07T20:32:39.6581800Z         if scale_ub is not None:
2025-05-07T20:32:39.6582087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.6582424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.6582745Z             )
2025-05-07T20:32:39.6582954Z         else:
2025-05-07T20:32:39.6583169Z             scale_ub_tensor = None
2025-05-07T20:32:39.6583437Z     
2025-05-07T20:32:39.6583680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.6584000Z             op = silu_mul_quant
2025-05-07T20:32:39.6584263Z             if compiled:
2025-05-07T20:32:39.6584522Z                 op = torch.compile(op)
2025-05-07T20:32:39.6584825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6585109Z     
2025-05-07T20:32:39.6585314Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.6585483Z 
2025-05-07T20:32:39.6585590Z moe/activation_test.py:117: 
2025-05-07T20:32:39.6585891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6586235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.6586528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6587265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.6588028Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.6588574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.6589263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.6589928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.6590467Z     kernel = self.compile(
2025-05-07T20:32:39.6591013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.6591722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.6592129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6592372Z 
2025-05-07T20:32:39.6592583Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9b81a90>
2025-05-07T20:32:39.6593669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.6595045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd96a1280>}
2025-05-07T20:32:39.6596385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.6597412Z context = <triton._C.libtriton.ir.context object at 0x7fbfd968dcf0>
2025-05-07T20:32:39.6597716Z 
2025-05-07T20:32:39.6597904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.6598440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.6598915Z                            module_map=module_map)
2025-05-07T20:32:39.6599293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.6599657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.6599921Z E       ^
2025-05-07T20:32:39.6600392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.6600842Z 
2025-05-07T20:32:39.6601265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.6601780Z 
2025-05-07T20:32:39.6601929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6602358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6602772Z     T=16384,
2025-05-07T20:32:39.6602975Z     D=7168,
2025-05-07T20:32:39.6603173Z     scale_ub=1200.0,
2025-05-07T20:32:39.6603410Z     contiguous=False,
2025-05-07T20:32:39.6603647Z     compiled=True,
2025-05-07T20:32:39.6603857Z )
2025-05-07T20:32:39.8525384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8525931Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.8526217Z 
2025-05-07T20:32:39.8526304Z     @given(
2025-05-07T20:32:39.8526568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8527009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8527382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8527726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8528123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8528425Z     )
2025-05-07T20:32:39.8528779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8529237Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8529490Z         self,
2025-05-07T20:32:39.8529987Z         T: int,
2025-05-07T20:32:39.8530199Z         D: int,
2025-05-07T20:32:39.8530429Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8530708Z         contiguous: bool,
2025-05-07T20:32:39.8530959Z         compiled: bool,
2025-05-07T20:32:39.8531203Z     ) -> None:
2025-05-07T20:32:39.8531427Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8531682Z     
2025-05-07T20:32:39.8531970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8532317Z     
2025-05-07T20:32:39.8532522Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8532828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8533236Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8533491Z         x0 = x[:, :D]
2025-05-07T20:32:39.8533724Z         x1 = x[:, D:]
2025-05-07T20:32:39.8533937Z     
2025-05-07T20:32:39.8534138Z         if contiguous:
2025-05-07T20:32:39.8534382Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8534657Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8534912Z     
2025-05-07T20:32:39.8535125Z         if scale_ub is not None:
2025-05-07T20:32:39.8535414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8535758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8536082Z             )
2025-05-07T20:32:39.8536284Z         else:
2025-05-07T20:32:39.8536501Z             scale_ub_tensor = None
2025-05-07T20:32:39.8536763Z     
2025-05-07T20:32:39.8537003Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8537326Z             op = silu_mul_quant
2025-05-07T20:32:39.8537587Z             if compiled:
2025-05-07T20:32:39.8537850Z                 op = torch.compile(op)
2025-05-07T20:32:39.8538150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8538442Z     
2025-05-07T20:32:39.8538644Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8538814Z 
2025-05-07T20:32:39.8538928Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8539237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8539590Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8539922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8540797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.8541455Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.8542133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8542827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8543464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8544168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8544841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8545386Z     kernel = self.compile(
2025-05-07T20:32:39.8545940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8546609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8547022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8547257Z 
2025-05-07T20:32:39.8547473Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9618370>
2025-05-07T20:32:39.8548565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8549967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd96a1ee0>}
2025-05-07T20:32:39.8551430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8552473Z context = <triton._C.libtriton.ir.context object at 0x7fbfd975fc30>
2025-05-07T20:32:39.8552775Z 
2025-05-07T20:32:39.8552949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8553485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8553962Z                            module_map=module_map)
2025-05-07T20:32:39.8554402Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8554778Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8555054Z E       ^
2025-05-07T20:32:39.8555522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8555992Z 
2025-05-07T20:32:39.8556415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8556939Z 
2025-05-07T20:32:39.8557047Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8557472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8557883Z     T=1,
2025-05-07T20:32:39.8558079Z     D=7168,
2025-05-07T20:32:39.8558286Z     scale_ub=None,
2025-05-07T20:32:39.8558509Z     contiguous=False,
2025-05-07T20:32:39.8558745Z     compiled=False,
2025-05-07T20:32:39.8558963Z )
2025-05-07T20:32:39.8559288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8559833Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:39.8560122Z 
2025-05-07T20:32:39.8560204Z     @given(
2025-05-07T20:32:39.8560445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8560767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8561085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8561424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8561758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8562056Z     )
2025-05-07T20:32:39.8562413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8562860Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8563113Z         self,
2025-05-07T20:32:39.8563317Z         T: int,
2025-05-07T20:32:39.8563523Z         D: int,
2025-05-07T20:32:39.8563748Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8564081Z         contiguous: bool,
2025-05-07T20:32:39.8564331Z         compiled: bool,
2025-05-07T20:32:39.8564557Z     ) -> None:
2025-05-07T20:32:39.8564782Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8565033Z     
2025-05-07T20:32:39.8565309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8565667Z     
2025-05-07T20:32:39.8565872Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8566168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8566487Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8566737Z         x0 = x[:, :D]
2025-05-07T20:32:39.8566958Z         x1 = x[:, D:]
2025-05-07T20:32:39.8567179Z     
2025-05-07T20:32:39.8567376Z         if contiguous:
2025-05-07T20:32:39.8567613Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8567885Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8568141Z     
2025-05-07T20:32:39.8568337Z         if scale_ub is not None:
2025-05-07T20:32:39.8568633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8568983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8569307Z             )
2025-05-07T20:32:39.8569506Z         else:
2025-05-07T20:32:39.8569732Z             scale_ub_tensor = None
2025-05-07T20:32:39.8569995Z     
2025-05-07T20:32:39.8570319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8570651Z             op = silu_mul_quant
2025-05-07T20:32:39.8570924Z             if compiled:
2025-05-07T20:32:39.8571187Z                 op = torch.compile(op)
2025-05-07T20:32:39.8571497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8571784Z     
2025-05-07T20:32:39.8571982Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8572159Z 
2025-05-07T20:32:39.8572263Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8572568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8572911Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8573248Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8573951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8574652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8575207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8575904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8576577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8577121Z     kernel = self.compile(
2025-05-07T20:32:39.8577673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8578339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8578749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8578988Z 
2025-05-07T20:32:39.8579200Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd96bcac0>
2025-05-07T20:32:39.8580288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8581749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd977d670>}
2025-05-07T20:32:39.8583097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8584123Z context = <triton._C.libtriton.ir.context object at 0x7fbfd93c1570>
2025-05-07T20:32:39.8584421Z 
2025-05-07T20:32:39.8584640Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8585177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8585649Z                            module_map=module_map)
2025-05-07T20:32:39.8586034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8586398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8586669Z E       ^
2025-05-07T20:32:39.8587140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8587591Z 
2025-05-07T20:32:39.8588011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8588532Z 
2025-05-07T20:32:39.8588638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8589067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8589481Z     T=2048,
2025-05-07T20:32:39.8589677Z     D=7168,
2025-05-07T20:32:39.8589883Z     scale_ub=None,
2025-05-07T20:32:39.8590112Z     contiguous=False,
2025-05-07T20:32:39.8590350Z     compiled=True,
2025-05-07T20:32:39.8590565Z )
2025-05-07T20:32:40.1605949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1606730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.1607011Z 
2025-05-07T20:32:40.1607104Z     @given(
2025-05-07T20:32:40.1607349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1615607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1615949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1616293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1616642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1616941Z     )
2025-05-07T20:32:40.1617471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1617936Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1618186Z         self,
2025-05-07T20:32:40.1618395Z         T: int,
2025-05-07T20:32:40.1618605Z         D: int,
2025-05-07T20:32:40.1618829Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1619132Z         contiguous: bool,
2025-05-07T20:32:40.1619386Z         compiled: bool,
2025-05-07T20:32:40.1619625Z     ) -> None:
2025-05-07T20:32:40.1619890Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1620162Z     
2025-05-07T20:32:40.1620447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1620804Z     
2025-05-07T20:32:40.1621010Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1621433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1621764Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1622020Z         x0 = x[:, :D]
2025-05-07T20:32:40.1622250Z         x1 = x[:, D:]
2025-05-07T20:32:40.1622471Z     
2025-05-07T20:32:40.1622669Z         if contiguous:
2025-05-07T20:32:40.1622917Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1623219Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1623466Z     
2025-05-07T20:32:40.1623671Z         if scale_ub is not None:
2025-05-07T20:32:40.1623963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1624323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1624642Z             )
2025-05-07T20:32:40.1624851Z         else:
2025-05-07T20:32:40.1625075Z             scale_ub_tensor = None
2025-05-07T20:32:40.1625329Z     
2025-05-07T20:32:40.1625568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1625899Z             op = silu_mul_quant
2025-05-07T20:32:40.1626159Z             if compiled:
2025-05-07T20:32:40.1626416Z                 op = torch.compile(op)
2025-05-07T20:32:40.1626726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1627016Z     
2025-05-07T20:32:40.1627305Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1627478Z 
2025-05-07T20:32:40.1627595Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1627894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1628243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1628548Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1629128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1629687Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1630364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1631059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1631604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1632306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1632992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1633535Z     kernel = self.compile(
2025-05-07T20:32:40.1634135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1634853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1635266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1635502Z 
2025-05-07T20:32:40.1635724Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9279ac0>
2025-05-07T20:32:40.1636822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1638379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd945b550>}
2025-05-07T20:32:40.1639780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1641157Z context = <triton._C.libtriton.ir.context object at 0x7fbfd945e870>
2025-05-07T20:32:40.1641454Z 
2025-05-07T20:32:40.1641628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1642173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1642658Z                            module_map=module_map)
2025-05-07T20:32:40.1643042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1643407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1643690Z E       ^
2025-05-07T20:32:40.1644165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1644624Z 
2025-05-07T20:32:40.1645043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1645572Z 
2025-05-07T20:32:40.1645683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.1646115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.1646531Z     T=4096,
2025-05-07T20:32:40.1646727Z     D=7168,
2025-05-07T20:32:40.1646936Z     scale_ub=None,
2025-05-07T20:32:40.1647164Z     contiguous=False,
2025-05-07T20:32:40.1647399Z     compiled=True,
2025-05-07T20:32:40.1647622Z )
2025-05-07T20:32:40.1647956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1648462Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.1648849Z 
2025-05-07T20:32:40.1648936Z     @given(
2025-05-07T20:32:40.1649187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1649520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1649835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1650184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1650524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1650817Z     )
2025-05-07T20:32:40.1651181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1651636Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1651882Z         self,
2025-05-07T20:32:40.1652091Z         T: int,
2025-05-07T20:32:40.1652300Z         D: int,
2025-05-07T20:32:40.1652522Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1652811Z         contiguous: bool,
2025-05-07T20:32:40.1653067Z         compiled: bool,
2025-05-07T20:32:40.1653294Z     ) -> None:
2025-05-07T20:32:40.1653535Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1653793Z     
2025-05-07T20:32:40.1654078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1654433Z     
2025-05-07T20:32:40.1654647Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1655018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1655411Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1655668Z         x0 = x[:, :D]
2025-05-07T20:32:40.1655901Z         x1 = x[:, D:]
2025-05-07T20:32:40.1656114Z     
2025-05-07T20:32:40.1656316Z         if contiguous:
2025-05-07T20:32:40.1656570Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1656835Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1657092Z     
2025-05-07T20:32:40.1657301Z         if scale_ub is not None:
2025-05-07T20:32:40.1657577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1657935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1658316Z             )
2025-05-07T20:32:40.1658521Z         else:
2025-05-07T20:32:40.1658746Z             scale_ub_tensor = None
2025-05-07T20:32:40.1659012Z     
2025-05-07T20:32:40.1659247Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1659574Z             op = silu_mul_quant
2025-05-07T20:32:40.1659844Z             if compiled:
2025-05-07T20:32:40.1660100Z                 op = torch.compile(op)
2025-05-07T20:32:40.1660410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1660701Z     
2025-05-07T20:32:40.1660901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1661166Z 
2025-05-07T20:32:40.1661268Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1661579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1661917Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1662215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1662783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1663357Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1664022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1664717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1665950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1666636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1667176Z     kernel = self.compile(
2025-05-07T20:32:40.1667723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1668382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1668841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1669077Z 
2025-05-07T20:32:40.1669304Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd947a280>
2025-05-07T20:32:40.1670395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1671771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd93d0160>}
2025-05-07T20:32:40.1673116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1674149Z context = <triton._C.libtriton.ir.context object at 0x7fbfd93ff8f0>
2025-05-07T20:32:40.1674446Z 
2025-05-07T20:32:40.1674623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1675147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1675663Z                            module_map=module_map)
2025-05-07T20:32:40.1676075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1676442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1676703Z E       ^
2025-05-07T20:32:40.1677178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1677627Z 
2025-05-07T20:32:40.1678053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1678565Z 
2025-05-07T20:32:40.3748667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3749587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3750188Z     T=16384,
2025-05-07T20:32:40.3750466Z     D=5120,
2025-05-07T20:32:40.3750736Z     scale_ub=1200.0,
2025-05-07T20:32:40.3751056Z     contiguous=False,
2025-05-07T20:32:40.3751315Z     compiled=False,
2025-05-07T20:32:40.3751533Z )
2025-05-07T20:32:40.3751867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3752379Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.3752661Z 
2025-05-07T20:32:40.3752750Z     @given(
2025-05-07T20:32:40.3752981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3753305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3753622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3753954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3754293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3754600Z     )
2025-05-07T20:32:40.3754951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3755402Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3755652Z         self,
2025-05-07T20:32:40.3755858Z         T: int,
2025-05-07T20:32:40.3756056Z         D: int,
2025-05-07T20:32:40.3756289Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3756573Z         contiguous: bool,
2025-05-07T20:32:40.3756816Z         compiled: bool,
2025-05-07T20:32:40.3757055Z     ) -> None:
2025-05-07T20:32:40.3757282Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3757541Z     
2025-05-07T20:32:40.3757822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3758178Z     
2025-05-07T20:32:40.3758372Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3758677Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3759002Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3759250Z         x0 = x[:, :D]
2025-05-07T20:32:40.3759591Z         x1 = x[:, D:]
2025-05-07T20:32:40.3759811Z     
2025-05-07T20:32:40.3760007Z         if contiguous:
2025-05-07T20:32:40.3760244Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3760514Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3760768Z     
2025-05-07T20:32:40.3760965Z         if scale_ub is not None:
2025-05-07T20:32:40.3761250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3761598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3761911Z             )
2025-05-07T20:32:40.3762111Z         else:
2025-05-07T20:32:40.3762359Z             scale_ub_tensor = None
2025-05-07T20:32:40.3762619Z     
2025-05-07T20:32:40.3762852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3763174Z             op = silu_mul_quant
2025-05-07T20:32:40.3763433Z             if compiled:
2025-05-07T20:32:40.3763683Z                 op = torch.compile(op)
2025-05-07T20:32:40.3763992Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3764288Z     
2025-05-07T20:32:40.3764487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3764661Z 
2025-05-07T20:32:40.3764765Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3765065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3765595Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3765884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3766581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3767275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3767813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3768500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3769177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3769795Z     kernel = self.compile(
2025-05-07T20:32:40.3770359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3771021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3771431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3771666Z 
2025-05-07T20:32:40.3771884Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9400280>
2025-05-07T20:32:40.3772967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3774354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd93d0940>}
2025-05-07T20:32:40.3775702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3776725Z context = <triton._C.libtriton.ir.context object at 0x7fbfd91df7f0>
2025-05-07T20:32:40.3777020Z 
2025-05-07T20:32:40.3777197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3777734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3778208Z                            module_map=module_map)
2025-05-07T20:32:40.3778583Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3778937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3779207Z E       ^
2025-05-07T20:32:40.3779730Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3780190Z 
2025-05-07T20:32:40.3780607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3781240Z 
2025-05-07T20:32:40.3781351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3781775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3782188Z     T=16384,
2025-05-07T20:32:40.3782384Z     D=5120,
2025-05-07T20:32:40.3782586Z     scale_ub=1200.0,
2025-05-07T20:32:40.3782818Z     contiguous=True,
2025-05-07T20:32:40.3783043Z     compiled=True,
2025-05-07T20:32:40.3783257Z )
2025-05-07T20:32:40.3783582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3784082Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.3784365Z 
2025-05-07T20:32:40.3784446Z     @given(
2025-05-07T20:32:40.3784690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3785014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3785324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3785663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3786003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3786382Z     )
2025-05-07T20:32:40.3786740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3787194Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3787436Z         self,
2025-05-07T20:32:40.3787636Z         T: int,
2025-05-07T20:32:40.3787842Z         D: int,
2025-05-07T20:32:40.3788061Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3788340Z         contiguous: bool,
2025-05-07T20:32:40.3788586Z         compiled: bool,
2025-05-07T20:32:40.3788810Z     ) -> None:
2025-05-07T20:32:40.3789033Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3789285Z     
2025-05-07T20:32:40.3789570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3789967Z     
2025-05-07T20:32:40.3790169Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3790471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3790786Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3791032Z         x0 = x[:, :D]
2025-05-07T20:32:40.3791258Z         x1 = x[:, D:]
2025-05-07T20:32:40.3791467Z     
2025-05-07T20:32:40.3791660Z         if contiguous:
2025-05-07T20:32:40.3791904Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3792168Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3792417Z     
2025-05-07T20:32:40.3792616Z         if scale_ub is not None:
2025-05-07T20:32:40.3792892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3793242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3793562Z             )
2025-05-07T20:32:40.3793759Z         else:
2025-05-07T20:32:40.3793980Z             scale_ub_tensor = None
2025-05-07T20:32:40.3794249Z     
2025-05-07T20:32:40.3794486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3794814Z             op = silu_mul_quant
2025-05-07T20:32:40.3795082Z             if compiled:
2025-05-07T20:32:40.3795344Z                 op = torch.compile(op)
2025-05-07T20:32:40.3795647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3795939Z     
2025-05-07T20:32:40.3796142Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3796311Z 
2025-05-07T20:32:40.3796414Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3796719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3797062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3797350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3797916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3798481Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3799199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3799932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3800495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3801188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3801853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3802399Z     kernel = self.compile(
2025-05-07T20:32:40.3802951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3803615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3804012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3804256Z 
2025-05-07T20:32:40.3804475Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd928e6d0>
2025-05-07T20:32:40.3805605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3807035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd932a550>}
2025-05-07T20:32:40.3808380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3809414Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9317cb0>
2025-05-07T20:32:40.3809718Z 
2025-05-07T20:32:40.3809893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3810473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3810946Z                            module_map=module_map)
2025-05-07T20:32:40.3811324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3811695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3811966Z E       ^
2025-05-07T20:32:40.3812427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3812884Z 
2025-05-07T20:32:40.3813306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3813817Z 
2025-05-07T20:32:40.6041655Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6042326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6042909Z     T=16384,
2025-05-07T20:32:40.6043210Z     D=5120,
2025-05-07T20:32:40.6043431Z     scale_ub=None,
2025-05-07T20:32:40.6043663Z     contiguous=False,
2025-05-07T20:32:40.6043901Z     compiled=True,
2025-05-07T20:32:40.6044125Z )
2025-05-07T20:32:40.6044458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6044970Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.6045255Z 
2025-05-07T20:32:40.6045339Z     @given(
2025-05-07T20:32:40.6045578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6045899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6046208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6046548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6046887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6047176Z     )
2025-05-07T20:32:40.6047536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6048274Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6048524Z         self,
2025-05-07T20:32:40.6048730Z         T: int,
2025-05-07T20:32:40.6048934Z         D: int,
2025-05-07T20:32:40.6049155Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6049438Z         contiguous: bool,
2025-05-07T20:32:40.6049690Z         compiled: bool,
2025-05-07T20:32:40.6049925Z     ) -> None:
2025-05-07T20:32:40.6050144Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6050395Z     
2025-05-07T20:32:40.6050672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6051019Z     
2025-05-07T20:32:40.6051215Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6051513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6051827Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6052075Z         x0 = x[:, :D]
2025-05-07T20:32:40.6052302Z         x1 = x[:, D:]
2025-05-07T20:32:40.6052511Z     
2025-05-07T20:32:40.6052701Z         if contiguous:
2025-05-07T20:32:40.6052949Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6053209Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6053459Z     
2025-05-07T20:32:40.6053656Z         if scale_ub is not None:
2025-05-07T20:32:40.6053931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6054348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6054736Z             )
2025-05-07T20:32:40.6054930Z         else:
2025-05-07T20:32:40.6055154Z             scale_ub_tensor = None
2025-05-07T20:32:40.6055415Z     
2025-05-07T20:32:40.6055654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6055972Z             op = silu_mul_quant
2025-05-07T20:32:40.6056234Z             if compiled:
2025-05-07T20:32:40.6056494Z                 op = torch.compile(op)
2025-05-07T20:32:40.6056791Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6057081Z     
2025-05-07T20:32:40.6064985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6065313Z 
2025-05-07T20:32:40.6065432Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6065746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6066090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6066395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6066983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6067545Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6068213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6068906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6069460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6070145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6070827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6071380Z     kernel = self.compile(
2025-05-07T20:32:40.6071930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6072609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6073018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6073259Z 
2025-05-07T20:32:40.6073480Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd920e6d0>
2025-05-07T20:32:40.6074576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6076005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd92d21f0>}
2025-05-07T20:32:40.6077379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6078405Z context = <triton._C.libtriton.ir.context object at 0x7fbfd92d0a70>
2025-05-07T20:32:40.6078697Z 
2025-05-07T20:32:40.6078876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6079400Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6079884Z                            module_map=module_map)
2025-05-07T20:32:40.6080261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6080620Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6080896Z E       ^
2025-05-07T20:32:40.6081369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6081881Z 
2025-05-07T20:32:40.6082415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6082941Z 
2025-05-07T20:32:40.6083172Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6083603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6084022Z     T=2048,
2025-05-07T20:32:40.6084213Z     D=5120,
2025-05-07T20:32:40.6084416Z     scale_ub=None,
2025-05-07T20:32:40.6084647Z     contiguous=False,
2025-05-07T20:32:40.6084887Z     compiled=True,
2025-05-07T20:32:40.6085094Z )
2025-05-07T20:32:40.7285013Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.7286523Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.7287309Z 
2025-05-07T20:32:40.7287533Z     @given(
2025-05-07T20:32:40.7288580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.7289215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.7289827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.7290177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.7290520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.7290819Z     )
2025-05-07T20:32:40.7291182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.7291645Z     def test_silu_mul_quant(
2025-05-07T20:32:40.7291897Z         self,
2025-05-07T20:32:40.7292111Z         T: int,
2025-05-07T20:32:40.7292324Z         D: int,
2025-05-07T20:32:40.7292551Z         scale_ub: Optional[float],
2025-05-07T20:32:40.7292840Z         contiguous: bool,
2025-05-07T20:32:40.7293095Z         compiled: bool,
2025-05-07T20:32:40.7293331Z     ) -> None:
2025-05-07T20:32:40.7293564Z         torch.manual_seed(2025)
2025-05-07T20:32:40.7293826Z     
2025-05-07T20:32:40.7294108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.7294466Z     
2025-05-07T20:32:40.7294675Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.7294974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.7295304Z         x = x_sign * x_clamp
2025-05-07T20:32:40.7295567Z         x0 = x[:, :D]
2025-05-07T20:32:40.7295789Z         x1 = x[:, D:]
2025-05-07T20:32:40.7296011Z     
2025-05-07T20:32:40.7296212Z         if contiguous:
2025-05-07T20:32:40.7296455Z             x0 = x0.contiguous()
2025-05-07T20:32:40.7296731Z             x1 = x1.contiguous()
2025-05-07T20:32:40.7296986Z     
2025-05-07T20:32:40.7297185Z         if scale_ub is not None:
2025-05-07T20:32:40.7297475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.7297829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.7298152Z             )
2025-05-07T20:32:40.7298355Z         else:
2025-05-07T20:32:40.7298680Z             scale_ub_tensor = None
2025-05-07T20:32:40.7298949Z     
2025-05-07T20:32:40.7299187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7299516Z             op = silu_mul_quant
2025-05-07T20:32:40.7299784Z             if compiled:
2025-05-07T20:32:40.7300074Z                 op = torch.compile(op)
2025-05-07T20:32:40.7300410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7300700Z     
2025-05-07T20:32:40.7300900Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.7301191Z 
2025-05-07T20:32:40.7301297Z moe/activation_test.py:117: 
2025-05-07T20:32:40.7301607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7301959Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.7302248Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7302824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.7303406Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.7304075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.7304767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.7305397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.7306162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.7306829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.7307378Z     kernel = self.compile(
2025-05-07T20:32:40.7307942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.7308603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.7309023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7309313Z 
2025-05-07T20:32:40.7309530Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd92f5760>
2025-05-07T20:32:40.7310623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.7312016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd92d2f70>}
2025-05-07T20:32:40.7313368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.7314396Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9116ab0>
2025-05-07T20:32:40.7314690Z 
2025-05-07T20:32:40.7314880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.7315425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.7315894Z                            module_map=module_map)
2025-05-07T20:32:40.7316280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.7316660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.7316927Z E       ^
2025-05-07T20:32:40.7317413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.7317862Z 
2025-05-07T20:32:40.7318293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.7318809Z 
2025-05-07T20:32:40.7318918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.7319341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.7319847Z     T=2048,
2025-05-07T20:32:40.7320045Z     D=5120,
2025-05-07T20:32:40.7320249Z     scale_ub=1200.0,
2025-05-07T20:32:40.7320487Z     contiguous=False,
2025-05-07T20:32:40.7320722Z     compiled=True,
2025-05-07T20:32:40.7320943Z )
2025-05-07T20:32:40.7321276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.7321793Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.7322072Z 
2025-05-07T20:32:40.7322157Z     @given(
2025-05-07T20:32:40.7322400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.7322727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.7323040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.7323383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.7323732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.7324022Z     )
2025-05-07T20:32:40.7324388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.7324843Z     def test_silu_mul_quant(
2025-05-07T20:32:40.7325088Z         self,
2025-05-07T20:32:40.7325294Z         T: int,
2025-05-07T20:32:40.7325502Z         D: int,
2025-05-07T20:32:40.7325732Z         scale_ub: Optional[float],
2025-05-07T20:32:40.7326056Z         contiguous: bool,
2025-05-07T20:32:40.7326345Z         compiled: bool,
2025-05-07T20:32:40.7326580Z     ) -> None:
2025-05-07T20:32:40.7326805Z         torch.manual_seed(2025)
2025-05-07T20:32:40.7327061Z     
2025-05-07T20:32:40.7327343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.7327690Z     
2025-05-07T20:32:40.7327896Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.7328200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.7328516Z         x = x_sign * x_clamp
2025-05-07T20:32:40.7328770Z         x0 = x[:, :D]
2025-05-07T20:32:40.7329000Z         x1 = x[:, D:]
2025-05-07T20:32:40.7329256Z     
2025-05-07T20:32:40.7329458Z         if contiguous:
2025-05-07T20:32:40.7329705Z             x0 = x0.contiguous()
2025-05-07T20:32:40.7329973Z             x1 = x1.contiguous()
2025-05-07T20:32:40.7330225Z     
2025-05-07T20:32:40.7330429Z         if scale_ub is not None:
2025-05-07T20:32:40.7330708Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.7331074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.7331397Z             )
2025-05-07T20:32:40.7331602Z         else:
2025-05-07T20:32:40.7331818Z             scale_ub_tensor = None
2025-05-07T20:32:40.7332088Z     
2025-05-07T20:32:40.7332334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7332659Z             op = silu_mul_quant
2025-05-07T20:32:40.7332928Z             if compiled:
2025-05-07T20:32:40.7333190Z                 op = torch.compile(op)
2025-05-07T20:32:40.7333496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7333785Z     
2025-05-07T20:32:40.7334000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.7334170Z 
2025-05-07T20:32:40.7334273Z moe/activation_test.py:117: 
2025-05-07T20:32:40.7334579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7334931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.7335234Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7335797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.7336359Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.7337026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.7337713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.7338270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.7339010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.7339690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.7340498Z     kernel = self.compile(
2025-05-07T20:32:40.7341114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.7341789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.7342192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7342432Z 
2025-05-07T20:32:40.7342647Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9139a00>
2025-05-07T20:32:40.7343732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.7345105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd913c940>}
2025-05-07T20:32:40.7346564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.7347641Z context = <triton._C.libtriton.ir.context object at 0x7fbfd900f970>
2025-05-07T20:32:40.7347940Z 
2025-05-07T20:32:40.7348110Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.7348655Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.7349122Z                            module_map=module_map)
2025-05-07T20:32:40.7349501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.7349870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.7350137Z E       ^
2025-05-07T20:32:40.7350677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.7351131Z 
2025-05-07T20:32:40.7351556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.7352073Z 
2025-05-07T20:32:41.1511448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1512144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1512685Z     T=4096,
2025-05-07T20:32:41.1512879Z     D=5120,
2025-05-07T20:32:41.1513085Z     scale_ub=1200.0,
2025-05-07T20:32:41.1513322Z     contiguous=True,
2025-05-07T20:32:41.1513552Z     compiled=True,
2025-05-07T20:32:41.1513777Z )
2025-05-07T20:32:41.1514109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1514610Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.1514904Z 
2025-05-07T20:32:41.1514998Z     @given(
2025-05-07T20:32:41.1515236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1515564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1515883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1516230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1516578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1516871Z     )
2025-05-07T20:32:41.1517231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1517683Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1517926Z         self,
2025-05-07T20:32:41.1518130Z         T: int,
2025-05-07T20:32:41.1518336Z         D: int,
2025-05-07T20:32:41.1518559Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1518842Z         contiguous: bool,
2025-05-07T20:32:41.1519092Z         compiled: bool,
2025-05-07T20:32:41.1519322Z     ) -> None:
2025-05-07T20:32:41.1519849Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1520107Z     
2025-05-07T20:32:41.1520391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1520738Z     
2025-05-07T20:32:41.1520944Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1521242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1521561Z         x = x_sign * x_clamp
2025-05-07T20:32:41.1521812Z         x0 = x[:, :D]
2025-05-07T20:32:41.1522039Z         x1 = x[:, D:]
2025-05-07T20:32:41.1522249Z     
2025-05-07T20:32:41.1522446Z         if contiguous:
2025-05-07T20:32:41.1522689Z             x0 = x0.contiguous()
2025-05-07T20:32:41.1522956Z             x1 = x1.contiguous()
2025-05-07T20:32:41.1523209Z     
2025-05-07T20:32:41.1523412Z         if scale_ub is not None:
2025-05-07T20:32:41.1523690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.1524040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.1524362Z             )
2025-05-07T20:32:41.1524564Z         else:
2025-05-07T20:32:41.1524785Z             scale_ub_tensor = None
2025-05-07T20:32:41.1525046Z     
2025-05-07T20:32:41.1525289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.1525609Z             op = silu_mul_quant
2025-05-07T20:32:41.1525871Z             if compiled:
2025-05-07T20:32:41.1526293Z                 op = torch.compile(op)
2025-05-07T20:32:41.1526596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1526884Z     
2025-05-07T20:32:41.1527088Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.1527254Z 
2025-05-07T20:32:41.1527358Z moe/activation_test.py:117: 
2025-05-07T20:32:41.1527667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.1528011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.1528295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1528867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.1529509Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.1530176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.1530860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.1531414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.1532103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.1532776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.1533307Z     kernel = self.compile(
2025-05-07T20:32:41.1533942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.1534644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.1535051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.1535294Z 
2025-05-07T20:32:41.1535505Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd904cd30>
2025-05-07T20:32:41.1536599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.1537998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9107790>}
2025-05-07T20:32:41.1539352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.1540734Z context = <triton._C.libtriton.ir.context object at 0x7fbfd90d9a30>
2025-05-07T20:32:41.1541201Z 
2025-05-07T20:32:41.1541377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.1541909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.1542380Z                            module_map=module_map)
2025-05-07T20:32:41.1542757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.1543116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.1543385Z E       ^
2025-05-07T20:32:41.1543847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.1544300Z 
2025-05-07T20:32:41.1544718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.1545236Z 
2025-05-07T20:32:41.1545342Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1545766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1546170Z     T=128,
2025-05-07T20:32:41.1546369Z     D=5120,
2025-05-07T20:32:41.1546569Z     scale_ub=1200.0,
2025-05-07T20:32:41.1546796Z     contiguous=False,
2025-05-07T20:32:41.1547030Z     compiled=True,
2025-05-07T20:32:41.1547244Z )
2025-05-07T20:32:41.2875893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2876675Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.2877071Z 
2025-05-07T20:32:41.2877190Z     @given(
2025-05-07T20:32:41.2877442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2877764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2878088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2878432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2878769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2879070Z     )
2025-05-07T20:32:41.2879550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2880005Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2880253Z         self,
2025-05-07T20:32:41.2880462Z         T: int,
2025-05-07T20:32:41.2880672Z         D: int,
2025-05-07T20:32:41.2880909Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2881203Z         contiguous: bool,
2025-05-07T20:32:41.2881456Z         compiled: bool,
2025-05-07T20:32:41.2881690Z     ) -> None:
2025-05-07T20:32:41.2881922Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2882179Z     
2025-05-07T20:32:41.2882457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2882815Z     
2025-05-07T20:32:41.2883022Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2883321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2883650Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2883912Z         x0 = x[:, :D]
2025-05-07T20:32:41.2884144Z         x1 = x[:, D:]
2025-05-07T20:32:41.2884364Z     
2025-05-07T20:32:41.2884564Z         if contiguous:
2025-05-07T20:32:41.2884806Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2885086Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2885344Z     
2025-05-07T20:32:41.2885553Z         if scale_ub is not None:
2025-05-07T20:32:41.2885844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2886197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2886520Z             )
2025-05-07T20:32:41.2886726Z         else:
2025-05-07T20:32:41.2886953Z             scale_ub_tensor = None
2025-05-07T20:32:41.2887221Z     
2025-05-07T20:32:41.2887464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2887799Z             op = silu_mul_quant
2025-05-07T20:32:41.2888069Z             if compiled:
2025-05-07T20:32:41.2888332Z                 op = torch.compile(op)
2025-05-07T20:32:41.2888650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2889031Z     
2025-05-07T20:32:41.2889237Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2889414Z 
2025-05-07T20:32:41.2889520Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2889827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2890177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2890475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2891049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2891618Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2892287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2892985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2893536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2894233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2894907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2895452Z     kernel = self.compile(
2025-05-07T20:32:41.2896050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2896782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2897193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2897438Z 
2025-05-07T20:32:41.2897654Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ee2d30>
2025-05-07T20:32:41.2898766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2900202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8f0e0d0>}
2025-05-07T20:32:41.2901657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2902687Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8f45430>
2025-05-07T20:32:41.2902980Z 
2025-05-07T20:32:41.2903162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2903699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2904167Z                            module_map=module_map)
2025-05-07T20:32:41.2904547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2904912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2905186Z E       ^
2025-05-07T20:32:41.2905666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2906123Z 
2025-05-07T20:32:41.2906545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2907062Z 
2025-05-07T20:32:41.2907177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2907594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2908010Z     T=16384,
2025-05-07T20:32:41.2915455Z     D=7168,
2025-05-07T20:32:41.2915702Z     scale_ub=1200.0,
2025-05-07T20:32:41.2915950Z     contiguous=True,
2025-05-07T20:32:41.2916188Z     compiled=True,
2025-05-07T20:32:41.2916407Z )
2025-05-07T20:32:41.2916747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2917345Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.2917638Z 
2025-05-07T20:32:41.2917735Z     @given(
2025-05-07T20:32:41.2917976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2918311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2918633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2918974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2919321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2919626Z     )
2025-05-07T20:32:41.2919989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2920448Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2920706Z         self,
2025-05-07T20:32:41.2920908Z         T: int,
2025-05-07T20:32:41.2921124Z         D: int,
2025-05-07T20:32:41.2921358Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2921637Z         contiguous: bool,
2025-05-07T20:32:41.2921894Z         compiled: bool,
2025-05-07T20:32:41.2922139Z     ) -> None:
2025-05-07T20:32:41.2922374Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2922627Z     
2025-05-07T20:32:41.2922921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2923284Z     
2025-05-07T20:32:41.2923484Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2923879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2924215Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2924463Z         x0 = x[:, :D]
2025-05-07T20:32:41.2924695Z         x1 = x[:, D:]
2025-05-07T20:32:41.2924917Z     
2025-05-07T20:32:41.2925109Z         if contiguous:
2025-05-07T20:32:41.2925360Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2925644Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2925892Z     
2025-05-07T20:32:41.2926095Z         if scale_ub is not None:
2025-05-07T20:32:41.2926386Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2926735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2927111Z             )
2025-05-07T20:32:41.2927315Z         else:
2025-05-07T20:32:41.2927537Z             scale_ub_tensor = None
2025-05-07T20:32:41.2927793Z     
2025-05-07T20:32:41.2928038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2928366Z             op = silu_mul_quant
2025-05-07T20:32:41.2928629Z             if compiled:
2025-05-07T20:32:41.2928891Z                 op = torch.compile(op)
2025-05-07T20:32:41.2929203Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2929488Z     
2025-05-07T20:32:41.2929695Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2929868Z 
2025-05-07T20:32:41.2929984Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2930292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2930641Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2930940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2931514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2932090Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2932766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2933479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2934021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2934730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2935410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2935956Z     kernel = self.compile(
2025-05-07T20:32:41.2936510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2937238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2937672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2937909Z 
2025-05-07T20:32:41.2938126Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e8cb20>
2025-05-07T20:32:41.2939225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2941204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8f0ed30>}
2025-05-07T20:32:41.2942575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2943615Z context = <triton._C.libtriton.ir.context object at 0x7fbfd90a73f0>
2025-05-07T20:32:41.2943912Z 
2025-05-07T20:32:41.2944090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2944641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2945277Z                            module_map=module_map)
2025-05-07T20:32:41.2945667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2946032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2946315Z E       ^
2025-05-07T20:32:41.2946795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2947249Z 
2025-05-07T20:32:41.2947669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2948191Z 
2025-05-07T20:32:41.5712521Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5713325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5713757Z     T=16384,
2025-05-07T20:32:41.5713960Z     D=5120,
2025-05-07T20:32:41.5714168Z     scale_ub=1200.0,
2025-05-07T20:32:41.5714406Z     contiguous=True,
2025-05-07T20:32:41.5714633Z     compiled=False,
2025-05-07T20:32:41.5714866Z )
2025-05-07T20:32:41.5715199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5715705Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.5715998Z 
2025-05-07T20:32:41.5716083Z     @given(
2025-05-07T20:32:41.5716331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5716650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5716972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5717319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5717669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5717968Z     )
2025-05-07T20:32:41.5718332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5718787Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5719035Z         self,
2025-05-07T20:32:41.5719246Z         T: int,
2025-05-07T20:32:41.5719466Z         D: int,
2025-05-07T20:32:41.5719691Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5719989Z         contiguous: bool,
2025-05-07T20:32:41.5720272Z         compiled: bool,
2025-05-07T20:32:41.5720529Z     ) -> None:
2025-05-07T20:32:41.5720765Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5721027Z     
2025-05-07T20:32:41.5721303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5721661Z     
2025-05-07T20:32:41.5721868Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5722173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5722499Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5722855Z         x0 = x[:, :D]
2025-05-07T20:32:41.5723079Z         x1 = x[:, D:]
2025-05-07T20:32:41.5723297Z     
2025-05-07T20:32:41.5723493Z         if contiguous:
2025-05-07T20:32:41.5723729Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5724004Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5724261Z     
2025-05-07T20:32:41.5724462Z         if scale_ub is not None:
2025-05-07T20:32:41.5724751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5725105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5725421Z             )
2025-05-07T20:32:41.5725629Z         else:
2025-05-07T20:32:41.5725852Z             scale_ub_tensor = None
2025-05-07T20:32:41.5726115Z     
2025-05-07T20:32:41.5726362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5726686Z             op = silu_mul_quant
2025-05-07T20:32:41.5726945Z             if compiled:
2025-05-07T20:32:41.5727205Z                 op = torch.compile(op)
2025-05-07T20:32:41.5727519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5727805Z     
2025-05-07T20:32:41.5728000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5728174Z 
2025-05-07T20:32:41.5728279Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5728583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5729067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5729360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5730063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5730758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5731305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5731995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5732668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5733252Z     kernel = self.compile(
2025-05-07T20:32:41.5733800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5734467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5734887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5735122Z 
2025-05-07T20:32:41.5735336Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ec7b20>
2025-05-07T20:32:41.5736430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5737845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9052700>}
2025-05-07T20:32:41.5739190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5740476Z context = <triton._C.libtriton.ir.context object at 0x7fbfd91c56f0>
2025-05-07T20:32:41.5740781Z 
2025-05-07T20:32:41.5740952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5741548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5742025Z                            module_map=module_map)
2025-05-07T20:32:41.5742397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5742767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5743037Z E       ^
2025-05-07T20:32:41.5743581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5744054Z 
2025-05-07T20:32:41.5744473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5744994Z 
2025-05-07T20:32:41.5745101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5745557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5745975Z     T=1,
2025-05-07T20:32:41.5746165Z     D=7168,
2025-05-07T20:32:41.5746370Z     scale_ub=1200.0,
2025-05-07T20:32:41.5746604Z     contiguous=False,
2025-05-07T20:32:41.5746834Z     compiled=False,
2025-05-07T20:32:41.5747050Z )
2025-05-07T20:32:41.5747378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5747871Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.5748154Z 
2025-05-07T20:32:41.5748236Z     @given(
2025-05-07T20:32:41.5748487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5748813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5749126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5749470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5749808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5750227Z     )
2025-05-07T20:32:41.5750594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5751042Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5751286Z         self,
2025-05-07T20:32:41.5751490Z         T: int,
2025-05-07T20:32:41.5751694Z         D: int,
2025-05-07T20:32:41.5751916Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5752200Z         contiguous: bool,
2025-05-07T20:32:41.5752447Z         compiled: bool,
2025-05-07T20:32:41.5752684Z     ) -> None:
2025-05-07T20:32:41.5752904Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5753157Z     
2025-05-07T20:32:41.5753528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5753883Z     
2025-05-07T20:32:41.5754090Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5754385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5754710Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5754974Z         x0 = x[:, :D]
2025-05-07T20:32:41.5755195Z         x1 = x[:, D:]
2025-05-07T20:32:41.5755421Z     
2025-05-07T20:32:41.5755623Z         if contiguous:
2025-05-07T20:32:41.5755857Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5756126Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5756377Z     
2025-05-07T20:32:41.5756578Z         if scale_ub is not None:
2025-05-07T20:32:41.5756855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5757201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5757527Z             )
2025-05-07T20:32:41.5757726Z         else:
2025-05-07T20:32:41.5757953Z             scale_ub_tensor = None
2025-05-07T20:32:41.5758220Z     
2025-05-07T20:32:41.5758453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5758782Z             op = silu_mul_quant
2025-05-07T20:32:41.5759044Z             if compiled:
2025-05-07T20:32:41.5759296Z                 op = torch.compile(op)
2025-05-07T20:32:41.5759609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5759898Z     
2025-05-07T20:32:41.5760091Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5760268Z 
2025-05-07T20:32:41.5760371Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5760671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5761015Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5761303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5762006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5762758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5763306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5763998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5764672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5765214Z     kernel = self.compile(
2025-05-07T20:32:41.5765755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5766421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5766827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5767061Z 
2025-05-07T20:32:41.5767282Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd90a22b0>
2025-05-07T20:32:41.5768371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5769799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd900c0d0>}
2025-05-07T20:32:41.5771197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5772239Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9009b70>
2025-05-07T20:32:41.5772532Z 
2025-05-07T20:32:41.5772705Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5773250Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5773775Z                            module_map=module_map)
2025-05-07T20:32:41.5774161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5774523Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5774795Z E       ^
2025-05-07T20:32:41.5775268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5775722Z 
2025-05-07T20:32:41.5776141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5776662Z 
2025-05-07T20:32:41.5776770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5777198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5777615Z     T=4096,
2025-05-07T20:32:41.5777809Z     D=7168,
2025-05-07T20:32:41.5778013Z     scale_ub=1200.0,
2025-05-07T20:32:41.5778250Z     contiguous=False,
2025-05-07T20:32:41.5778482Z     compiled=True,
2025-05-07T20:32:41.5778703Z )
2025-05-07T20:32:41.6962047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6963153Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.6963704Z 
2025-05-07T20:32:41.6963875Z     @given(
2025-05-07T20:32:41.6964363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6965010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6965629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6966297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6966953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6967532Z     )
2025-05-07T20:32:41.6968230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6969108Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6969594Z         self,
2025-05-07T20:32:41.6969991Z         T: int,
2025-05-07T20:32:41.6970258Z         D: int,
2025-05-07T20:32:41.6970788Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6971077Z         contiguous: bool,
2025-05-07T20:32:41.6971318Z         compiled: bool,
2025-05-07T20:32:41.6971553Z     ) -> None:
2025-05-07T20:32:41.6971779Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6972025Z     
2025-05-07T20:32:41.6972311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6972661Z     
2025-05-07T20:32:41.6972860Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6973151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6973470Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6973718Z         x0 = x[:, :D]
2025-05-07T20:32:41.6973935Z         x1 = x[:, D:]
2025-05-07T20:32:41.6974148Z     
2025-05-07T20:32:41.6974342Z         if contiguous:
2025-05-07T20:32:41.6974575Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6974849Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6975098Z     
2025-05-07T20:32:41.6975299Z         if scale_ub is not None:
2025-05-07T20:32:41.6975583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6975930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6976242Z             )
2025-05-07T20:32:41.6976449Z         else:
2025-05-07T20:32:41.6976814Z             scale_ub_tensor = None
2025-05-07T20:32:41.6977070Z     
2025-05-07T20:32:41.6977311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6977637Z             op = silu_mul_quant
2025-05-07T20:32:41.6977905Z             if compiled:
2025-05-07T20:32:41.6978154Z                 op = torch.compile(op)
2025-05-07T20:32:41.6978459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6978744Z     
2025-05-07T20:32:41.6978939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6979112Z 
2025-05-07T20:32:41.6979215Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6979520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6979945Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6980251Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6980827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.6981491Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.6982153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6982847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6983389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6984066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6984733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6985270Z     kernel = self.compile(
2025-05-07T20:32:41.6985821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6986475Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6986879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6987117Z 
2025-05-07T20:32:41.6987337Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ecc550>
2025-05-07T20:32:41.6988420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6989804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd900cdc0>}
2025-05-07T20:32:41.6991210Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6992250Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8e13e70>
2025-05-07T20:32:41.6992543Z 
2025-05-07T20:32:41.6992726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6993251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6993731Z                            module_map=module_map)
2025-05-07T20:32:41.6994108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6994470Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6994732Z E       ^
2025-05-07T20:32:41.6995200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6995647Z 
2025-05-07T20:32:41.6996072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6996586Z 
2025-05-07T20:32:41.6996693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6997110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6997566Z     T=128,
2025-05-07T20:32:41.6997802Z     D=7168,
2025-05-07T20:32:41.6998001Z     scale_ub=1200.0,
2025-05-07T20:32:41.6998235Z     contiguous=False,
2025-05-07T20:32:41.6998469Z     compiled=True,
2025-05-07T20:32:41.6998681Z )
2025-05-07T20:32:41.6999015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6999519Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.6999791Z 
2025-05-07T20:32:41.6999874Z     @given(
2025-05-07T20:32:41.7000115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7000440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7000803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7001140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7001481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7001778Z     )
2025-05-07T20:32:41.7002130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7002586Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7002839Z         self,
2025-05-07T20:32:41.7003034Z         T: int,
2025-05-07T20:32:41.7003241Z         D: int,
2025-05-07T20:32:41.7003466Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7003740Z         contiguous: bool,
2025-05-07T20:32:41.7003992Z         compiled: bool,
2025-05-07T20:32:41.7004223Z     ) -> None:
2025-05-07T20:32:41.7004441Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7004694Z     
2025-05-07T20:32:41.7004973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7005317Z     
2025-05-07T20:32:41.7005521Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7005821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7006143Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7006386Z         x0 = x[:, :D]
2025-05-07T20:32:41.7006610Z         x1 = x[:, D:]
2025-05-07T20:32:41.7006826Z     
2025-05-07T20:32:41.7007017Z         if contiguous:
2025-05-07T20:32:41.7007255Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7007521Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7007765Z     
2025-05-07T20:32:41.7007964Z         if scale_ub is not None:
2025-05-07T20:32:41.7008246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7008582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7008896Z             )
2025-05-07T20:32:41.7009096Z         else:
2025-05-07T20:32:41.7009308Z             scale_ub_tensor = None
2025-05-07T20:32:41.7009567Z     
2025-05-07T20:32:41.7009808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7010185Z             op = silu_mul_quant
2025-05-07T20:32:41.7010452Z             if compiled:
2025-05-07T20:32:41.7010708Z                 op = torch.compile(op)
2025-05-07T20:32:41.7011015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7011301Z     
2025-05-07T20:32:41.7011511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7019262Z 
2025-05-07T20:32:41.7019385Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7019706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7020047Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7020346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7020926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7021586Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7022259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7022970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7023519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7024203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7024997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7025553Z     kernel = self.compile(
2025-05-07T20:32:41.7026109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7026776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7027188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7027432Z 
2025-05-07T20:32:41.7027645Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e2ca90>
2025-05-07T20:32:41.7028782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7030185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8e9a940>}
2025-05-07T20:32:41.7031565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7032590Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8fa5cb0>
2025-05-07T20:32:41.7032883Z 
2025-05-07T20:32:41.7033058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7033590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7034065Z                            module_map=module_map)
2025-05-07T20:32:41.7034447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7034816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7035078Z E       ^
2025-05-07T20:32:41.7035554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7036012Z 
2025-05-07T20:32:41.7036437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7036952Z 
2025-05-07T20:32:41.8789231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8789748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8790173Z     T=2048,
2025-05-07T20:32:41.8790368Z     D=7168,
2025-05-07T20:32:41.8790574Z     scale_ub=None,
2025-05-07T20:32:41.8790803Z     contiguous=True,
2025-05-07T20:32:41.8791330Z     compiled=True,
2025-05-07T20:32:41.8791557Z )
2025-05-07T20:32:41.8791891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8792400Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.8792679Z 
2025-05-07T20:32:41.8792769Z     @given(
2025-05-07T20:32:41.8793018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8793348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8793667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8794015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8794363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8794657Z     )
2025-05-07T20:32:41.8795027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8795489Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8795751Z         self,
2025-05-07T20:32:41.8795963Z         T: int,
2025-05-07T20:32:41.8796183Z         D: int,
2025-05-07T20:32:41.8796418Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8796703Z         contiguous: bool,
2025-05-07T20:32:41.8796961Z         compiled: bool,
2025-05-07T20:32:41.8797203Z     ) -> None:
2025-05-07T20:32:41.8797429Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8797853Z     
2025-05-07T20:32:41.8798144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8798493Z     
2025-05-07T20:32:41.8798703Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8799017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8799338Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8799600Z         x0 = x[:, :D]
2025-05-07T20:32:41.8799837Z         x1 = x[:, D:]
2025-05-07T20:32:41.8800052Z     
2025-05-07T20:32:41.8800274Z         if contiguous:
2025-05-07T20:32:41.8800551Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8800831Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8801181Z     
2025-05-07T20:32:41.8801389Z         if scale_ub is not None:
2025-05-07T20:32:41.8801681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8802025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8802353Z             )
2025-05-07T20:32:41.8802571Z         else:
2025-05-07T20:32:41.8802792Z             scale_ub_tensor = None
2025-05-07T20:32:41.8803066Z     
2025-05-07T20:32:41.8803315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8803641Z             op = silu_mul_quant
2025-05-07T20:32:41.8803910Z             if compiled:
2025-05-07T20:32:41.8804175Z                 op = torch.compile(op)
2025-05-07T20:32:41.8804478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8804768Z     
2025-05-07T20:32:41.8804984Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8805156Z 
2025-05-07T20:32:41.8805264Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8805576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8805928Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8806225Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8806803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.8807380Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.8808055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8808748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8809303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8810005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8810687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8811284Z     kernel = self.compile(
2025-05-07T20:32:41.8811856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8812533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8812957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8813198Z 
2025-05-07T20:32:41.8813415Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e2f820>
2025-05-07T20:32:41.8814512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8815913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8c4f550>}
2025-05-07T20:32:41.8817288Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8818318Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8f79d70>
2025-05-07T20:32:41.8818656Z 
2025-05-07T20:32:41.8818899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8819448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8819924Z                            module_map=module_map)
2025-05-07T20:32:41.8820306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8820675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8820942Z E       ^
2025-05-07T20:32:41.8821511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8822019Z 
2025-05-07T20:32:41.8822452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8822969Z 
2025-05-07T20:32:41.8823088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8823507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8823932Z     T=16384,
2025-05-07T20:32:41.8824142Z     D=5120,
2025-05-07T20:32:41.8824343Z     scale_ub=None,
2025-05-07T20:32:41.8824573Z     contiguous=False,
2025-05-07T20:32:41.8824815Z     compiled=False,
2025-05-07T20:32:41.8825025Z )
2025-05-07T20:32:41.8825355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8825948Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.8826323Z 
2025-05-07T20:32:41.8826417Z     @given(
2025-05-07T20:32:41.8826653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8826986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8827310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8827646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8827989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8828290Z     )
2025-05-07T20:32:41.8828655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8829111Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8829367Z         self,
2025-05-07T20:32:41.8829571Z         T: int,
2025-05-07T20:32:41.8829786Z         D: int,
2025-05-07T20:32:41.8830018Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8830300Z         contiguous: bool,
2025-05-07T20:32:41.8830559Z         compiled: bool,
2025-05-07T20:32:41.8830796Z     ) -> None:
2025-05-07T20:32:41.8831027Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8831281Z     
2025-05-07T20:32:41.8831565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8831993Z     
2025-05-07T20:32:41.8832200Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8832507Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8834537Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.8836440Z 
2025-05-07T20:32:41.8836566Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.8836789Z 
2025-05-07T20:32:41.8836905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8837329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8837761Z     T=4096,
2025-05-07T20:32:41.8837965Z     D=7168,
2025-05-07T20:32:41.8838176Z     scale_ub=1200.0,
2025-05-07T20:32:41.8838407Z     contiguous=True,
2025-05-07T20:32:41.8838645Z     compiled=True,
2025-05-07T20:32:41.8838908Z )
2025-05-07T20:32:41.8839276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8839790Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.8840437Z 
2025-05-07T20:32:41.8840546Z     @given(
2025-05-07T20:32:41.8840785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8841114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8841440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8841782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8842131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8842575Z     )
2025-05-07T20:32:41.8842946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8843402Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8843658Z         self,
2025-05-07T20:32:41.8843866Z         T: int,
2025-05-07T20:32:41.8844070Z         D: int,
2025-05-07T20:32:41.8844315Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8844604Z         contiguous: bool,
2025-05-07T20:32:41.8844850Z         compiled: bool,
2025-05-07T20:32:41.8845087Z     ) -> None:
2025-05-07T20:32:41.8845320Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8845572Z     
2025-05-07T20:32:41.8845858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8846212Z     
2025-05-07T20:32:41.8846411Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8846740Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8848751Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.8850639Z 
2025-05-07T20:32:41.8850788Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.8851008Z 
2025-05-07T20:32:41.8851126Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8851546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8851967Z     T=16384,
2025-05-07T20:32:41.8852176Z     D=7168,
2025-05-07T20:32:41.8852382Z     scale_ub=None,
2025-05-07T20:32:41.8852610Z     contiguous=False,
2025-05-07T20:32:41.8852850Z     compiled=False,
2025-05-07T20:32:41.8853135Z )
2025-05-07T20:32:41.9898016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9898589Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.9898869Z 
2025-05-07T20:32:41.9898960Z     @given(
2025-05-07T20:32:41.9899222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9899552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9899870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9900217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9900550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9900856Z     )
2025-05-07T20:32:41.9901440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9901887Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9902139Z         self,
2025-05-07T20:32:41.9902341Z         T: int,
2025-05-07T20:32:41.9902554Z         D: int,
2025-05-07T20:32:41.9902781Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9903059Z         contiguous: bool,
2025-05-07T20:32:41.9903302Z         compiled: bool,
2025-05-07T20:32:41.9903537Z     ) -> None:
2025-05-07T20:32:41.9903760Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9904257Z     
2025-05-07T20:32:41.9904611Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9906666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9908639Z 
2025-05-07T20:32:41.9908766Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9908986Z 
2025-05-07T20:32:41.9909102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9909517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9909928Z     T=2048,
2025-05-07T20:32:41.9910132Z     D=7168,
2025-05-07T20:32:41.9910330Z     scale_ub=1200.0,
2025-05-07T20:32:41.9910567Z     contiguous=True,
2025-05-07T20:32:41.9910805Z     compiled=True,
2025-05-07T20:32:41.9911014Z )
2025-05-07T20:32:41.9911343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9911849Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.9912123Z 
2025-05-07T20:32:41.9912213Z     @given(
2025-05-07T20:32:41.9912444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9912773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9913096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9913429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9913773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9914071Z     )
2025-05-07T20:32:41.9914431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9914888Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9915141Z         self,
2025-05-07T20:32:41.9915349Z         T: int,
2025-05-07T20:32:41.9915551Z         D: int,
2025-05-07T20:32:41.9915783Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9916068Z         contiguous: bool,
2025-05-07T20:32:41.9916315Z         compiled: bool,
2025-05-07T20:32:41.9916556Z     ) -> None:
2025-05-07T20:32:41.9916787Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9917042Z     
2025-05-07T20:32:41.9917324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9917678Z     
2025-05-07T20:32:41.9917956Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9918260Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9920259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9922086Z 
2025-05-07T20:32:41.9922208Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.9922425Z 
2025-05-07T20:32:41.9922539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9922953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9923365Z     T=2048,
2025-05-07T20:32:41.9923562Z     D=7168,
2025-05-07T20:32:41.9923754Z     scale_ub=None,
2025-05-07T20:32:41.9923976Z     contiguous=True,
2025-05-07T20:32:41.9924214Z     compiled=False,
2025-05-07T20:32:41.9924421Z )
2025-05-07T20:32:41.9924793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9925335Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.9925609Z 
2025-05-07T20:32:41.9925699Z     @given(
2025-05-07T20:32:41.9925931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9926262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9926579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9926911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9927253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9927549Z     )
2025-05-07T20:32:41.9927952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9928403Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9928658Z         self,
2025-05-07T20:32:41.9928859Z         T: int,
2025-05-07T20:32:41.9929069Z         D: int,
2025-05-07T20:32:41.9929299Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9929581Z         contiguous: bool,
2025-05-07T20:32:41.9929832Z         compiled: bool,
2025-05-07T20:32:41.9930068Z     ) -> None:
2025-05-07T20:32:41.9930295Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9930540Z     
2025-05-07T20:32:41.9930821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9931173Z     
2025-05-07T20:32:41.9931371Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.9933297Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9935135Z 
2025-05-07T20:32:41.9935259Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.9935476Z 
2025-05-07T20:32:41.9935588Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9936012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9936425Z     T=1,
2025-05-07T20:32:41.9936619Z     D=7168,
2025-05-07T20:32:41.9936822Z     scale_ub=1200.0,
2025-05-07T20:32:41.9937049Z     contiguous=True,
2025-05-07T20:32:41.9937281Z     compiled=False,
2025-05-07T20:32:41.9937499Z )
2025-05-07T20:32:42.3310297Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3310879Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.3311153Z 
2025-05-07T20:32:42.3311243Z     @given(
2025-05-07T20:32:42.3311477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3311809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3312147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3312488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3312827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3313126Z     )
2025-05-07T20:32:42.3313483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3313939Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3314194Z         self,
2025-05-07T20:32:42.3314394Z         T: int,
2025-05-07T20:32:42.3314604Z         D: int,
2025-05-07T20:32:42.3314836Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3315120Z         contiguous: bool,
2025-05-07T20:32:42.3315375Z         compiled: bool,
2025-05-07T20:32:42.3315617Z     ) -> None:
2025-05-07T20:32:42.3315852Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3316101Z     
2025-05-07T20:32:42.3316383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3316825Z     
2025-05-07T20:32:42.3317101Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3317411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3317740Z         x = x_sign * x_clamp
2025-05-07T20:32:42.3317988Z         x0 = x[:, :D]
2025-05-07T20:32:42.3318217Z         x1 = x[:, D:]
2025-05-07T20:32:42.3318434Z     
2025-05-07T20:32:42.3318623Z         if contiguous:
2025-05-07T20:32:42.3318866Z             x0 = x0.contiguous()
2025-05-07T20:32:42.3319140Z             x1 = x1.contiguous()
2025-05-07T20:32:42.3319386Z     
2025-05-07T20:32:42.3319586Z         if scale_ub is not None:
2025-05-07T20:32:42.3319874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.3320304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.3320630Z             )
2025-05-07T20:32:42.3320835Z         else:
2025-05-07T20:32:42.3321055Z             scale_ub_tensor = None
2025-05-07T20:32:42.3321312Z     
2025-05-07T20:32:42.3321561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.3321889Z             op = silu_mul_quant
2025-05-07T20:32:42.3322148Z             if compiled:
2025-05-07T20:32:42.3322407Z                 op = torch.compile(op)
2025-05-07T20:32:42.3322715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3323000Z     
2025-05-07T20:32:42.3323204Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.3323373Z 
2025-05-07T20:32:42.3323484Z moe/activation_test.py:117: 
2025-05-07T20:32:42.3323783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3324134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.3324431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3325146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.3325853Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.3326412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.3327119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.3327788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.3328345Z     kernel = self.compile(
2025-05-07T20:32:42.3328900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.3329568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.3330024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3330272Z 
2025-05-07T20:32:42.3330486Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8b53310>
2025-05-07T20:32:42.3331578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.3332969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8d50040>}
2025-05-07T20:32:42.3334321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.3335350Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8d511f0>
2025-05-07T20:32:42.3335654Z 
2025-05-07T20:32:42.3335837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.3336379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.3336849Z                            module_map=module_map)
2025-05-07T20:32:42.3337280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.3337727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.3338000Z E       ^
2025-05-07T20:32:42.3338465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.3338920Z 
2025-05-07T20:32:42.3347324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.3347908Z 
2025-05-07T20:32:42.3348024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3348457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3348997Z     T=128,
2025-05-07T20:32:42.3349206Z     D=5120,
2025-05-07T20:32:42.3349410Z     scale_ub=None,
2025-05-07T20:32:42.3349627Z     contiguous=True,
2025-05-07T20:32:42.3349866Z     compiled=False,
2025-05-07T20:32:42.3350093Z )
2025-05-07T20:32:42.3350415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3350936Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.3351223Z 
2025-05-07T20:32:42.3351306Z     @given(
2025-05-07T20:32:42.3351554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3351879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3352198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3352543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3352878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3353184Z     )
2025-05-07T20:32:42.3353550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3353999Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3354251Z         self,
2025-05-07T20:32:42.3354455Z         T: int,
2025-05-07T20:32:42.3354654Z         D: int,
2025-05-07T20:32:42.3354887Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3355176Z         contiguous: bool,
2025-05-07T20:32:42.3355426Z         compiled: bool,
2025-05-07T20:32:42.3355663Z     ) -> None:
2025-05-07T20:32:42.3355888Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3356140Z     
2025-05-07T20:32:42.3356416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3356770Z     
2025-05-07T20:32:42.3356977Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3357271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3357599Z         x = x_sign * x_clamp
2025-05-07T20:32:42.3357855Z         x0 = x[:, :D]
2025-05-07T20:32:42.3358077Z         x1 = x[:, D:]
2025-05-07T20:32:42.3358298Z     
2025-05-07T20:32:42.3358567Z         if contiguous:
2025-05-07T20:32:42.3358807Z             x0 = x0.contiguous()
2025-05-07T20:32:42.3359075Z             x1 = x1.contiguous()
2025-05-07T20:32:42.3359324Z     
2025-05-07T20:32:42.3359517Z         if scale_ub is not None:
2025-05-07T20:32:42.3359798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.3360152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.3360463Z             )
2025-05-07T20:32:42.3360668Z         else:
2025-05-07T20:32:42.3360887Z             scale_ub_tensor = None
2025-05-07T20:32:42.3361152Z     
2025-05-07T20:32:42.3361387Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.3361720Z             op = silu_mul_quant
2025-05-07T20:32:42.3361989Z             if compiled:
2025-05-07T20:32:42.3362245Z                 op = torch.compile(op)
2025-05-07T20:32:42.3362553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3362841Z     
2025-05-07T20:32:42.3363047Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.3363223Z 
2025-05-07T20:32:42.3363328Z moe/activation_test.py:117: 
2025-05-07T20:32:42.3363639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3363980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.3364279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3365123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.3365838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.3366387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.3367085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.3367757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.3368296Z     kernel = self.compile(
2025-05-07T20:32:42.3368905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.3369568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.3369980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3370217Z 
2025-05-07T20:32:42.3370428Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8d75e20>
2025-05-07T20:32:42.3371513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.3372898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8d50a60>}
2025-05-07T20:32:42.3374249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.3375289Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8bb99b0>
2025-05-07T20:32:42.3375583Z 
2025-05-07T20:32:42.3375763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.3376295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.3376771Z                            module_map=module_map)
2025-05-07T20:32:42.3377142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.3377510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.3377785Z E       ^
2025-05-07T20:32:42.3378263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.3378723Z 
2025-05-07T20:32:42.3379197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.3379720Z 
2025-05-07T20:32:42.3379828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3380253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3380667Z     T=128,
2025-05-07T20:32:42.3380863Z     D=7168,
2025-05-07T20:32:42.3381148Z     scale_ub=None,
2025-05-07T20:32:42.3381371Z     contiguous=True,
2025-05-07T20:32:42.3381598Z     compiled=False,
2025-05-07T20:32:42.3381814Z )
2025-05-07T20:32:42.4281172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4281701Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.4281981Z 
2025-05-07T20:32:42.4282085Z     @given(
2025-05-07T20:32:42.4282330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4282659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4282992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4283339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4283688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4283990Z     )
2025-05-07T20:32:42.4284570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4285097Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4285355Z         self,
2025-05-07T20:32:42.4285554Z         T: int,
2025-05-07T20:32:42.4285766Z         D: int,
2025-05-07T20:32:42.4285999Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4286278Z         contiguous: bool,
2025-05-07T20:32:42.4286535Z         compiled: bool,
2025-05-07T20:32:42.4286773Z     ) -> None:
2025-05-07T20:32:42.4286995Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4287253Z     
2025-05-07T20:32:42.4287545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4287893Z     
2025-05-07T20:32:42.4288193Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4288496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4288810Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4289062Z         x0 = x[:, :D]
2025-05-07T20:32:42.4289289Z         x1 = x[:, D:]
2025-05-07T20:32:42.4289501Z     
2025-05-07T20:32:42.4289705Z         if contiguous:
2025-05-07T20:32:42.4289950Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4290220Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4290498Z     
2025-05-07T20:32:42.4290727Z         if scale_ub is not None:
2025-05-07T20:32:42.4291003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4291348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4291670Z             )
2025-05-07T20:32:42.4291866Z         else:
2025-05-07T20:32:42.4292088Z             scale_ub_tensor = None
2025-05-07T20:32:42.4292350Z     
2025-05-07T20:32:42.4292584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4292916Z             op = silu_mul_quant
2025-05-07T20:32:42.4293176Z             if compiled:
2025-05-07T20:32:42.4293426Z                 op = torch.compile(op)
2025-05-07T20:32:42.4293729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4294009Z     
2025-05-07T20:32:42.4294213Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.4294381Z 
2025-05-07T20:32:42.4294486Z moe/activation_test.py:117: 
2025-05-07T20:32:42.4294794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4295135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.4295418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4296116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.4296824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.4297446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4298143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4298811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4299348Z     kernel = self.compile(
2025-05-07T20:32:42.4299897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4300557Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4300962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4301299Z 
2025-05-07T20:32:42.4301517Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8bc56d0>
2025-05-07T20:32:42.4302599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4303981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8cd1790>}
2025-05-07T20:32:42.4305370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4306429Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8beae30>
2025-05-07T20:32:42.4306718Z 
2025-05-07T20:32:42.4306897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4307418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4307890Z                            module_map=module_map)
2025-05-07T20:32:42.4308270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4308670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.4308939Z E       ^
2025-05-07T20:32:42.4309413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4309861Z 
2025-05-07T20:32:42.4310289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.4310802Z 
2025-05-07T20:32:42.4310909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4311332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4311739Z     T=2048,
2025-05-07T20:32:42.4311927Z     D=7168,
2025-05-07T20:32:42.4312127Z     scale_ub=1200.0,
2025-05-07T20:32:42.4312359Z     contiguous=True,
2025-05-07T20:32:42.4312585Z     compiled=False,
2025-05-07T20:32:42.4312802Z )
2025-05-07T20:32:42.4313127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4313637Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.4313920Z 
2025-05-07T20:32:42.4314002Z     @given(
2025-05-07T20:32:42.4314242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4314566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4314882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4315224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4315560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4315848Z     )
2025-05-07T20:32:42.4316202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4316662Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4316906Z         self,
2025-05-07T20:32:42.4317109Z         T: int,
2025-05-07T20:32:42.4317314Z         D: int,
2025-05-07T20:32:42.4317539Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4317811Z         contiguous: bool,
2025-05-07T20:32:42.4318111Z         compiled: bool,
2025-05-07T20:32:42.4318342Z     ) -> None:
2025-05-07T20:32:42.4318559Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4318826Z     
2025-05-07T20:32:42.4319100Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4321189Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.4323066Z 
2025-05-07T20:32:42.4323186Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.4323415Z 
2025-05-07T20:32:42.4323521Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4323943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4324345Z     T=1,
2025-05-07T20:32:42.4324540Z     D=5120,
2025-05-07T20:32:42.4324742Z     scale_ub=1200.0,
2025-05-07T20:32:42.4325079Z     contiguous=True,
2025-05-07T20:32:42.4325314Z     compiled=False,
2025-05-07T20:32:42.4325532Z )
2025-05-07T20:32:42.4808359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4808921Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.4809192Z 
2025-05-07T20:32:42.4809274Z     @given(
2025-05-07T20:32:42.4809512Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4809838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4810153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4810503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4811045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4811338Z     )
2025-05-07T20:32:42.4811692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4812144Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4812397Z         self,
2025-05-07T20:32:42.4812598Z         T: int,
2025-05-07T20:32:42.4812803Z         D: int,
2025-05-07T20:32:42.4813031Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4813306Z         contiguous: bool,
2025-05-07T20:32:42.4813558Z         compiled: bool,
2025-05-07T20:32:42.4813791Z     ) -> None:
2025-05-07T20:32:42.4814009Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4814261Z     
2025-05-07T20:32:42.4814542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4814898Z     
2025-05-07T20:32:42.4815090Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4815389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4815715Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4815959Z         x0 = x[:, :D]
2025-05-07T20:32:42.4816185Z         x1 = x[:, D:]
2025-05-07T20:32:42.4816408Z     
2025-05-07T20:32:42.4816595Z         if contiguous:
2025-05-07T20:32:42.4816840Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4817114Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4817358Z     
2025-05-07T20:32:42.4817557Z         if scale_ub is not None:
2025-05-07T20:32:42.4817842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4818181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4818502Z             )
2025-05-07T20:32:42.4818705Z         else:
2025-05-07T20:32:42.4818919Z             scale_ub_tensor = None
2025-05-07T20:32:42.4819185Z     
2025-05-07T20:32:42.4819426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4819755Z             op = silu_mul_quant
2025-05-07T20:32:42.4820010Z             if compiled:
2025-05-07T20:32:42.4820382Z                 op = torch.compile(op)
2025-05-07T20:32:42.4820696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4820972Z     
2025-05-07T20:32:42.4821237Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.4821404Z 
2025-05-07T20:32:42.4821516Z moe/activation_test.py:117: 
2025-05-07T20:32:42.4821819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4822162Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.4822458Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4823146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.4823847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.4824399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4825089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4825757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4826299Z     kernel = self.compile(
2025-05-07T20:32:42.4826847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4827646Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4828052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4828291Z 
2025-05-07T20:32:42.4828503Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8c065b0>
2025-05-07T20:32:42.4829583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4831017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8c96040>}
2025-05-07T20:32:42.4832399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4833439Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8cb8db0>
2025-05-07T20:32:42.4833738Z 
2025-05-07T20:32:42.4833909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4834440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4834909Z                            module_map=module_map)
2025-05-07T20:32:42.4835285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4835647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.4835910Z E       ^
2025-05-07T20:32:42.4836382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4836837Z 
2025-05-07T20:32:42.4837253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.4837769Z 
2025-05-07T20:32:42.4837886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4838298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4838710Z     T=2048,
2025-05-07T20:32:42.4838909Z     D=5120,
2025-05-07T20:32:42.4839102Z     scale_ub=None,
2025-05-07T20:32:42.4839325Z     contiguous=True,
2025-05-07T20:32:42.4839559Z     compiled=False,
2025-05-07T20:32:42.4839774Z )
2025-05-07T20:32:42.4840349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4840854Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.4841134Z 
2025-05-07T20:32:42.4841303Z     @given(
2025-05-07T20:32:42.4841540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4841865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4842182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4842513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4842857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4843154Z     )
2025-05-07T20:32:42.4843515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4843967Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4844219Z         self,
2025-05-07T20:32:42.4844422Z         T: int,
2025-05-07T20:32:42.4844629Z         D: int,
2025-05-07T20:32:42.4844858Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4845142Z         contiguous: bool,
2025-05-07T20:32:42.4845387Z         compiled: bool,
2025-05-07T20:32:42.4845616Z     ) -> None:
2025-05-07T20:32:42.4845850Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4846096Z     
2025-05-07T20:32:42.4846372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4846723Z     
2025-05-07T20:32:42.4846922Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.4848947Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.4850840Z 
2025-05-07T20:32:42.4850960Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.4851181Z 
2025-05-07T20:32:42.4851347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4851766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4852168Z     T=16384,
2025-05-07T20:32:42.4852367Z     D=5120,
2025-05-07T20:32:42.4852567Z     scale_ub=None,
2025-05-07T20:32:42.4852784Z     contiguous=True,
2025-05-07T20:32:42.4853023Z     compiled=False,
2025-05-07T20:32:42.4853240Z )
2025-05-07T20:32:42.4853557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4854063Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.4854345Z 
2025-05-07T20:32:42.4854426Z     @given(
2025-05-07T20:32:42.4854664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4854978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4855293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4855628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4855976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4856271Z     )
2025-05-07T20:32:42.4856630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4857077Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4857329Z         self,
2025-05-07T20:32:42.4857542Z         T: int,
2025-05-07T20:32:42.4857747Z         D: int,
2025-05-07T20:32:42.4857974Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4858259Z         contiguous: bool,
2025-05-07T20:32:42.4858503Z         compiled: bool,
2025-05-07T20:32:42.4858743Z     ) -> None:
2025-05-07T20:32:42.4858968Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4859224Z     
2025-05-07T20:32:42.4859498Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4861673Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.4863568Z 
2025-05-07T20:32:42.4863691Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.4863911Z 
2025-05-07T20:32:42.4864030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4864444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4864853Z     T=4096,
2025-05-07T20:32:42.4865051Z     D=5120,
2025-05-07T20:32:42.4865252Z     scale_ub=None,
2025-05-07T20:32:42.4865637Z     contiguous=True,
2025-05-07T20:32:42.4865870Z     compiled=False,
2025-05-07T20:32:42.4866086Z )
2025-05-07T20:32:42.5900065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5900789Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.5901137Z 
2025-05-07T20:32:42.5901222Z     @given(
2025-05-07T20:32:42.5901465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5902073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5902385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5902728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5903065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5903353Z     )
2025-05-07T20:32:42.5903710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5904163Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5904416Z         self,
2025-05-07T20:32:42.5904612Z         T: int,
2025-05-07T20:32:42.5904820Z         D: int,
2025-05-07T20:32:42.5905049Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5905406Z         contiguous: bool,
2025-05-07T20:32:42.5905655Z         compiled: bool,
2025-05-07T20:32:42.5905889Z     ) -> None:
2025-05-07T20:32:42.5906108Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5906359Z     
2025-05-07T20:32:42.5906636Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5908662Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5910537Z 
2025-05-07T20:32:42.5910662Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5910893Z 
2025-05-07T20:32:42.5910998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5911423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5911828Z     T=2048,
2025-05-07T20:32:42.5912017Z     D=5120,
2025-05-07T20:32:42.5912218Z     scale_ub=None,
2025-05-07T20:32:42.5912444Z     contiguous=False,
2025-05-07T20:32:42.5920739Z     compiled=False,
2025-05-07T20:32:42.5920980Z )
2025-05-07T20:32:42.5921315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5921835Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5922113Z 
2025-05-07T20:32:42.5922210Z     @given(
2025-05-07T20:32:42.5922446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5922772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5923094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5923557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5923905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5924202Z     )
2025-05-07T20:32:42.5924562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5925019Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5925282Z         self,
2025-05-07T20:32:42.5925490Z         T: int,
2025-05-07T20:32:42.5925696Z         D: int,
2025-05-07T20:32:42.5925931Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5926217Z         contiguous: bool,
2025-05-07T20:32:42.5926464Z         compiled: bool,
2025-05-07T20:32:42.5926702Z     ) -> None:
2025-05-07T20:32:42.5926938Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5927189Z     
2025-05-07T20:32:42.5927472Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5929589Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5931553Z 
2025-05-07T20:32:42.5931683Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5931900Z 
2025-05-07T20:32:42.5932021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5932440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5932849Z     T=4096,
2025-05-07T20:32:42.5933050Z     D=7168,
2025-05-07T20:32:42.5933244Z     scale_ub=None,
2025-05-07T20:32:42.5933469Z     contiguous=True,
2025-05-07T20:32:42.5933702Z     compiled=True,
2025-05-07T20:32:42.5933959Z )
2025-05-07T20:32:42.5934289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5934787Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5935059Z 
2025-05-07T20:32:42.5935141Z     @given(
2025-05-07T20:32:42.5935380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5935713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5936032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5936367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5936709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5937009Z     )
2025-05-07T20:32:42.5937363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5937817Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5938071Z         self,
2025-05-07T20:32:42.5938269Z         T: int,
2025-05-07T20:32:42.5938479Z         D: int,
2025-05-07T20:32:42.5938719Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5938996Z         contiguous: bool,
2025-05-07T20:32:42.5939252Z         compiled: bool,
2025-05-07T20:32:42.5939492Z     ) -> None:
2025-05-07T20:32:42.5939715Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5939976Z     
2025-05-07T20:32:42.5940583Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5942689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5944600Z 
2025-05-07T20:32:42.5944740Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5944959Z 
2025-05-07T20:32:42.5945070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5945496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5945908Z     T=2048,
2025-05-07T20:32:42.5946105Z     D=5120,
2025-05-07T20:32:42.5946314Z     scale_ub=1200.0,
2025-05-07T20:32:42.5946552Z     contiguous=False,
2025-05-07T20:32:42.5946784Z     compiled=False,
2025-05-07T20:32:42.5947009Z )
2025-05-07T20:32:42.5947339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5947850Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5948128Z 
2025-05-07T20:32:42.5948213Z     @given(
2025-05-07T20:32:42.5948453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5948784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5949105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5949448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5949794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5950083Z     )
2025-05-07T20:32:42.5950443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5951015Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5951273Z         self,
2025-05-07T20:32:42.5951478Z         T: int,
2025-05-07T20:32:42.5951690Z         D: int,
2025-05-07T20:32:42.5951920Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5952204Z         contiguous: bool,
2025-05-07T20:32:42.5952458Z         compiled: bool,
2025-05-07T20:32:42.5952694Z     ) -> None:
2025-05-07T20:32:42.5952916Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5953174Z     
2025-05-07T20:32:42.5953459Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5955481Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5957418Z 
2025-05-07T20:32:42.5957540Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5957763Z 
2025-05-07T20:32:42.5957870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5958288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5958699Z     T=4096,
2025-05-07T20:32:42.5958889Z     D=7168,
2025-05-07T20:32:42.5959091Z     scale_ub=1200.0,
2025-05-07T20:32:42.5959327Z     contiguous=True,
2025-05-07T20:32:42.5959560Z     compiled=False,
2025-05-07T20:32:42.5959780Z )
2025-05-07T20:32:42.5960106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5960629Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.5960939Z 
2025-05-07T20:32:42.5961027Z     @given(
2025-05-07T20:32:42.5961267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5961594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5961913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5962254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5962597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5962891Z     )
2025-05-07T20:32:42.5963255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5963711Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5963961Z         self,
2025-05-07T20:32:42.5964222Z         T: int,
2025-05-07T20:32:42.5964435Z         D: int,
2025-05-07T20:32:42.5964659Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5964945Z         contiguous: bool,
2025-05-07T20:32:42.5965193Z         compiled: bool,
2025-05-07T20:32:42.5965424Z     ) -> None:
2025-05-07T20:32:42.5965655Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5965903Z     
2025-05-07T20:32:42.5966186Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5968209Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5970048Z 
2025-05-07T20:32:42.5970169Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5970386Z 
2025-05-07T20:32:42.5970502Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5970959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5971409Z     T=16384,
2025-05-07T20:32:42.5971612Z     D=7168,
2025-05-07T20:32:42.5971807Z     scale_ub=None,
2025-05-07T20:32:42.5972032Z     contiguous=False,
2025-05-07T20:32:42.5972267Z     compiled=True,
2025-05-07T20:32:42.5972478Z )
2025-05-07T20:32:42.7266765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7267310Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.7267604Z 
2025-05-07T20:32:42.7267689Z     @given(
2025-05-07T20:32:42.7267936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7268576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7268906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7269256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7269593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7269898Z     )
2025-05-07T20:32:42.7270277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7270724Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7270980Z         self,
2025-05-07T20:32:42.7271187Z         T: int,
2025-05-07T20:32:42.7271389Z         D: int,
2025-05-07T20:32:42.7271624Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7271910Z         contiguous: bool,
2025-05-07T20:32:42.7272164Z         compiled: bool,
2025-05-07T20:32:42.7272395Z     ) -> None:
2025-05-07T20:32:42.7272624Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7272879Z     
2025-05-07T20:32:42.7273158Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7275224Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7277081Z 
2025-05-07T20:32:42.7277211Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7277429Z 
2025-05-07T20:32:42.7277546Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7277973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7278384Z     T=4096,
2025-05-07T20:32:42.7278583Z     D=7168,
2025-05-07T20:32:42.7278871Z     scale_ub=None,
2025-05-07T20:32:42.7279095Z     contiguous=True,
2025-05-07T20:32:42.7279335Z     compiled=False,
2025-05-07T20:32:42.7279555Z )
2025-05-07T20:32:42.7279877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7280388Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7280669Z 
2025-05-07T20:32:42.7280760Z     @given(
2025-05-07T20:32:42.7280996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7281321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7281640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7281986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7282320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7282639Z     )
2025-05-07T20:32:42.7283006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7283466Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7283715Z         self,
2025-05-07T20:32:42.7283926Z         T: int,
2025-05-07T20:32:42.7284140Z         D: int,
2025-05-07T20:32:42.7284365Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7284652Z         contiguous: bool,
2025-05-07T20:32:42.7284906Z         compiled: bool,
2025-05-07T20:32:42.7285359Z     ) -> None:
2025-05-07T20:32:42.7285593Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7285851Z     
2025-05-07T20:32:42.7286126Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7288162Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7290079Z 
2025-05-07T20:32:42.7290205Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7290432Z 
2025-05-07T20:32:42.7290539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7290966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7291371Z     T=16384,
2025-05-07T20:32:42.7291578Z     D=7168,
2025-05-07T20:32:42.7291782Z     scale_ub=None,
2025-05-07T20:32:42.7292001Z     contiguous=True,
2025-05-07T20:32:42.7292238Z     compiled=False,
2025-05-07T20:32:42.7292468Z )
2025-05-07T20:32:42.7292788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7293298Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7293574Z 
2025-05-07T20:32:42.7293663Z     @given(
2025-05-07T20:32:42.7293904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7294230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7294549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7294883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7295231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7295534Z     )
2025-05-07T20:32:42.7295897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7296343Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7296605Z         self,
2025-05-07T20:32:42.7296821Z         T: int,
2025-05-07T20:32:42.7297026Z         D: int,
2025-05-07T20:32:42.7297256Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7297541Z         contiguous: bool,
2025-05-07T20:32:42.7297789Z         compiled: bool,
2025-05-07T20:32:42.7298024Z     ) -> None:
2025-05-07T20:32:42.7298250Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7298504Z     
2025-05-07T20:32:42.7298839Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7300877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7302857Z 
2025-05-07T20:32:42.7302980Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7303196Z 
2025-05-07T20:32:42.7303308Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7303721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7304144Z     T=16384,
2025-05-07T20:32:42.7304349Z     D=7168,
2025-05-07T20:32:42.7304546Z     scale_ub=1200.0,
2025-05-07T20:32:42.7304779Z     contiguous=True,
2025-05-07T20:32:42.7305014Z     compiled=False,
2025-05-07T20:32:42.7305224Z )
2025-05-07T20:32:42.7305549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7306145Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7306430Z 
2025-05-07T20:32:42.7306520Z     @given(
2025-05-07T20:32:42.7306755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7307086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7307406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7307741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7308080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7308377Z     )
2025-05-07T20:32:42.7308733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7309231Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7309487Z         self,
2025-05-07T20:32:42.7309689Z         T: int,
2025-05-07T20:32:42.7309900Z         D: int,
2025-05-07T20:32:42.7310130Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7310413Z         contiguous: bool,
2025-05-07T20:32:42.7310667Z         compiled: bool,
2025-05-07T20:32:42.7310901Z     ) -> None:
2025-05-07T20:32:42.7311130Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7311381Z     
2025-05-07T20:32:42.7311667Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7313700Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7315540Z 
2025-05-07T20:32:42.7315668Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7315890Z 
2025-05-07T20:32:42.7316001Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7316424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7316837Z     T=128,
2025-05-07T20:32:42.7317038Z     D=5120,
2025-05-07T20:32:42.7317237Z     scale_ub=1200.0,
2025-05-07T20:32:42.7317477Z     contiguous=False,
2025-05-07T20:32:42.7317715Z     compiled=False,
2025-05-07T20:32:42.7317929Z )
2025-05-07T20:32:42.8947851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.8948410Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.8948977Z 
2025-05-07T20:32:42.8949064Z     @given(
2025-05-07T20:32:42.8949307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.8949632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.8949947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.8950306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.8950676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.8950993Z     )
2025-05-07T20:32:42.8951359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.8951814Z     def test_silu_mul_quant(
2025-05-07T20:32:42.8952064Z         self,
2025-05-07T20:32:42.8952273Z         T: int,
2025-05-07T20:32:42.8952484Z         D: int,
2025-05-07T20:32:42.8952713Z         scale_ub: Optional[float],
2025-05-07T20:32:42.8952991Z         contiguous: bool,
2025-05-07T20:32:42.8953244Z         compiled: bool,
2025-05-07T20:32:42.8953480Z     ) -> None:
2025-05-07T20:32:42.8953717Z         torch.manual_seed(2025)
2025-05-07T20:32:42.8953981Z     
2025-05-07T20:32:42.8954262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.8954609Z     
2025-05-07T20:32:42.8954815Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.8955208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.8955594Z         x = x_sign * x_clamp
2025-05-07T20:32:42.8955849Z         x0 = x[:, :D]
2025-05-07T20:32:42.8956079Z         x1 = x[:, D:]
2025-05-07T20:32:42.8956290Z     
2025-05-07T20:32:42.8956489Z         if contiguous:
2025-05-07T20:32:42.8956738Z             x0 = x0.contiguous()
2025-05-07T20:32:42.8957003Z             x1 = x1.contiguous()
2025-05-07T20:32:42.8957254Z     
2025-05-07T20:32:42.8957457Z         if scale_ub is not None:
2025-05-07T20:32:42.8957738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.8958086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.8958494Z             )
2025-05-07T20:32:42.8958700Z         else:
2025-05-07T20:32:42.8958914Z             scale_ub_tensor = None
2025-05-07T20:32:42.8959179Z     
2025-05-07T20:32:42.8959419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.8959740Z             op = silu_mul_quant
2025-05-07T20:32:42.8960006Z             if compiled:
2025-05-07T20:32:42.8960267Z                 op = torch.compile(op)
2025-05-07T20:32:42.8960572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.8960864Z     
2025-05-07T20:32:42.8961066Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.8961235Z 
2025-05-07T20:32:42.8961340Z moe/activation_test.py:117: 
2025-05-07T20:32:42.8961643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.8961989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.8962281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.8962985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.8963689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.8964238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.8964927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.8965602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.8966149Z     kernel = self.compile(
2025-05-07T20:32:42.8966704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.8967366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.8967779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.8968017Z 
2025-05-07T20:32:42.8968286Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd89c5f10>
2025-05-07T20:32:42.8969390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.8970824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8a46ca0>}
2025-05-07T20:32:42.8972181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.8973211Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8982770>
2025-05-07T20:32:42.8973506Z 
2025-05-07T20:32:42.8973687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.8974219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.8974700Z                            module_map=module_map)
2025-05-07T20:32:42.8975081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.8975453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.8975771Z E       ^
2025-05-07T20:32:42.8976288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.8976741Z 
2025-05-07T20:32:42.8977165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.8977678Z 
2025-05-07T20:32:42.8977786Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.8978211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.8978625Z     T=2048,
2025-05-07T20:32:42.8978825Z     D=7168,
2025-05-07T20:32:42.8979023Z     scale_ub=None,
2025-05-07T20:32:42.8979297Z     contiguous=False,
2025-05-07T20:32:42.8979539Z     compiled=False,
2025-05-07T20:32:42.8979751Z )
2025-05-07T20:32:42.8980079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.8980588Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.8980895Z 
2025-05-07T20:32:42.8980990Z     @given(
2025-05-07T20:32:42.8981364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.8981695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.8982010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.8982349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.8982688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.8982984Z     )
2025-05-07T20:32:42.8983338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.8983799Z     def test_silu_mul_quant(
2025-05-07T20:32:42.8984059Z         self,
2025-05-07T20:32:42.8984258Z         T: int,
2025-05-07T20:32:42.8984470Z         D: int,
2025-05-07T20:32:42.8984699Z         scale_ub: Optional[float],
2025-05-07T20:32:42.8984977Z         contiguous: bool,
2025-05-07T20:32:42.8985227Z         compiled: bool,
2025-05-07T20:32:42.8985460Z     ) -> None:
2025-05-07T20:32:42.8985685Z         torch.manual_seed(2025)
2025-05-07T20:32:42.8985940Z     
2025-05-07T20:32:42.8986225Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.8988320Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.8990200Z 
2025-05-07T20:32:42.8990331Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.8990552Z 
2025-05-07T20:32:42.8990658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.8991087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.8991501Z     T=128,
2025-05-07T20:32:42.8991695Z     D=7168,
2025-05-07T20:32:42.8991899Z     scale_ub=1200.0,
2025-05-07T20:32:42.8992135Z     contiguous=True,
2025-05-07T20:32:42.8992360Z     compiled=True,
2025-05-07T20:32:42.8992575Z )
2025-05-07T20:32:42.9437860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9447550Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.9447898Z 
2025-05-07T20:32:42.9447997Z     @given(
2025-05-07T20:32:42.9448238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9448593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9448921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9449255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9449594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9450149Z     )
2025-05-07T20:32:42.9450580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9451042Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9451301Z         self,
2025-05-07T20:32:42.9451501Z         T: int,
2025-05-07T20:32:42.9451715Z         D: int,
2025-05-07T20:32:42.9451952Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9452229Z         contiguous: bool,
2025-05-07T20:32:42.9452484Z         compiled: bool,
2025-05-07T20:32:42.9452727Z     ) -> None:
2025-05-07T20:32:42.9452951Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9453208Z     
2025-05-07T20:32:42.9453503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9453948Z     
2025-05-07T20:32:42.9454152Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9454459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9454785Z         x = x_sign * x_clamp
2025-05-07T20:32:42.9455031Z         x0 = x[:, :D]
2025-05-07T20:32:42.9455268Z         x1 = x[:, D:]
2025-05-07T20:32:42.9455495Z     
2025-05-07T20:32:42.9455687Z         if contiguous:
2025-05-07T20:32:42.9455938Z             x0 = x0.contiguous()
2025-05-07T20:32:42.9456214Z             x1 = x1.contiguous()
2025-05-07T20:32:42.9456461Z     
2025-05-07T20:32:42.9456668Z         if scale_ub is not None:
2025-05-07T20:32:42.9456956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.9457300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.9457624Z             )
2025-05-07T20:32:42.9457831Z         else:
2025-05-07T20:32:42.9458048Z             scale_ub_tensor = None
2025-05-07T20:32:42.9458323Z     
2025-05-07T20:32:42.9458568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9458897Z             op = silu_mul_quant
2025-05-07T20:32:42.9459159Z             if compiled:
2025-05-07T20:32:42.9459422Z                 op = torch.compile(op)
2025-05-07T20:32:42.9459742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9460025Z     
2025-05-07T20:32:42.9460232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.9460403Z 
2025-05-07T20:32:42.9460520Z moe/activation_test.py:117: 
2025-05-07T20:32:42.9460857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9461300Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.9461634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9462211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.9462784Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.9463528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.9464241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.9464794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.9465486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.9466162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.9466702Z     kernel = self.compile(
2025-05-07T20:32:42.9467253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.9467907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.9468314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9468552Z 
2025-05-07T20:32:42.9468777Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd88d4220>
2025-05-07T20:32:42.9469909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.9471398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd89350d0>}
2025-05-07T20:32:42.9472743Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.9473768Z context = <triton._C.libtriton.ir.context object at 0x7fbfd88ba4f0>
2025-05-07T20:32:42.9474061Z 
2025-05-07T20:32:42.9474246Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.9474821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.9475293Z                            module_map=module_map)
2025-05-07T20:32:42.9475679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.9476059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.9476330Z E       ^
2025-05-07T20:32:42.9476804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.9477261Z 
2025-05-07T20:32:42.9477695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9478218Z 
2025-05-07T20:32:42.9478326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9478748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9479163Z     T=128,
2025-05-07T20:32:42.9479369Z     D=7168,
2025-05-07T20:32:42.9479568Z     scale_ub=1200.0,
2025-05-07T20:32:42.9479805Z     contiguous=True,
2025-05-07T20:32:42.9480043Z     compiled=False,
2025-05-07T20:32:42.9480256Z )
2025-05-07T20:32:42.9480587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9481098Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.9481380Z 
2025-05-07T20:32:42.9481463Z     @given(
2025-05-07T20:32:42.9481707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9482034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9482348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9482688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9483030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9483332Z     )
2025-05-07T20:32:42.9483692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9484206Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9484463Z         self,
2025-05-07T20:32:42.9484663Z         T: int,
2025-05-07T20:32:42.9484877Z         D: int,
2025-05-07T20:32:42.9485107Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9485384Z         contiguous: bool,
2025-05-07T20:32:42.9485642Z         compiled: bool,
2025-05-07T20:32:42.9485885Z     ) -> None:
2025-05-07T20:32:42.9486106Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9486359Z     
2025-05-07T20:32:42.9486645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9486995Z     
2025-05-07T20:32:42.9487200Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9487501Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9489496Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.9491495Z 
2025-05-07T20:32:42.9491629Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.9491854Z 
2025-05-07T20:32:42.9491960Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9492398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9492808Z     T=128,
2025-05-07T20:32:42.9492997Z     D=5120,
2025-05-07T20:32:42.9493198Z     scale_ub=1200.0,
2025-05-07T20:32:42.9493431Z     contiguous=True,
2025-05-07T20:32:42.9493654Z     compiled=True,
2025-05-07T20:32:42.9493867Z )
2025-05-07T20:32:42.9494197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9494752Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.9495026Z 
2025-05-07T20:32:42.9495108Z     @given(
2025-05-07T20:32:42.9495347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9495670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9495986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9496329Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9496672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9496961Z     )
2025-05-07T20:32:42.9497326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9497775Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9498027Z         self,
2025-05-07T20:32:42.9498222Z         T: int,
2025-05-07T20:32:42.9498425Z         D: int,
2025-05-07T20:32:42.9498650Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9498924Z         contiguous: bool,
2025-05-07T20:32:42.9499179Z         compiled: bool,
2025-05-07T20:32:42.9499411Z     ) -> None:
2025-05-07T20:32:42.9499628Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9499880Z     
2025-05-07T20:32:42.9500162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9500520Z     
2025-05-07T20:32:42.9500761Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9501215Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9503270Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.9505130Z 
2025-05-07T20:32:42.9505257Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.9505475Z 
2025-05-07T20:32:42.9505581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9506000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9506420Z     T=128,
2025-05-07T20:32:42.9506609Z     D=7168,
2025-05-07T20:32:42.9506810Z     scale_ub=None,
2025-05-07T20:32:42.9507030Z     contiguous=True,
2025-05-07T20:32:42.9507255Z     compiled=True,
2025-05-07T20:32:42.9507469Z )
2025-05-07T20:32:43.1686540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1687093Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1687366Z 
2025-05-07T20:32:43.1687449Z     @given(
2025-05-07T20:32:43.1687689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1688034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1688359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1688705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1689047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1689344Z     )
2025-05-07T20:32:43.1689982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1690749Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1691121Z         self,
2025-05-07T20:32:43.1691390Z         T: int,
2025-05-07T20:32:43.1691676Z         D: int,
2025-05-07T20:32:43.1691987Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1692357Z         contiguous: bool,
2025-05-07T20:32:43.1692680Z         compiled: bool,
2025-05-07T20:32:43.1692981Z     ) -> None:
2025-05-07T20:32:43.1693259Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1693582Z     
2025-05-07T20:32:43.1693867Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1696077Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1697958Z 
2025-05-07T20:32:43.1698087Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.1698303Z 
2025-05-07T20:32:43.1707891Z FAILED
2025-05-07T20:32:43.1708162Z 
2025-05-07T20:32:43.1708554Z =================================== FAILURES ===================================
2025-05-07T20:32:43.1709284Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:43.1709981Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:43.1710872Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:43.1711644Z   |     yield
2025-05-07T20:32:43.1712325Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:43.1713086Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:43.1713717Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:43.1714280Z   |     method()
2025-05-07T20:32:43.1714969Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:43.1715846Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1716973Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:43.1717883Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:43.1718633Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:43.1719332Z   +-+---------------- 1 ----------------
2025-05-07T20:32:43.1719789Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1720857Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1721995Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1724498Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1727304Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1728094Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1728674Z     |     T=2048,
2025-05-07T20:32:43.1729005Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:43.1729482Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:43.1729994Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:43.1730520Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:43.1730945Z     | )
2025-05-07T20:32:43.1731195Z     | 
2025-05-07T20:32:43.1731951Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:43.1732912Z     +---------------- 2 ----------------
2025-05-07T20:32:43.1733356Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1734358Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1735448Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1738313Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1741602Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1742273Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1742859Z     |     T=128,
2025-05-07T20:32:43.1743154Z     |     D=7168,
2025-05-07T20:32:43.1743452Z     |     scale_ub=None,
2025-05-07T20:32:43.1743800Z     |     contiguous=True,
2025-05-07T20:32:43.1744187Z     |     compiled=True,
2025-05-07T20:32:43.1744500Z     | )
2025-05-07T20:32:43.1744786Z     | 
2025-05-07T20:32:43.1745527Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1746243Z     +---------------- 3 ----------------
2025-05-07T20:32:43.1746543Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1747371Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1748165Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1750190Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1752240Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1752688Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1753102Z     |     T=128,
2025-05-07T20:32:43.1753324Z     |     D=5120,
2025-05-07T20:32:43.1753540Z     |     scale_ub=1200.0,
2025-05-07T20:32:43.1753794Z     |     contiguous=True,
2025-05-07T20:32:43.1754101Z     |     compiled=True,
2025-05-07T20:32:43.1754403Z     | )
2025-05-07T20:32:43.1754662Z     | 
2025-05-07T20:32:43.1755534Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1756497Z     +---------------- 4 ----------------
2025-05-07T20:32:43.1756924Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1757961Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:43.1759021Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1759954Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:43.1761110Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1762299Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:43.1763434Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1764334Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:43.1765394Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1766449Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:43.1767524Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1768672Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:43.1769795Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1770893Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:43.1771877Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1772797Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:43.1773596Z     |     fn()
2025-05-07T20:32:43.1774354Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:43.1775219Z     |     self.fn.run(
2025-05-07T20:32:43.1776054Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:43.1776869Z     |     kernel = self.compile(
2025-05-07T20:32:43.1777726Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:43.1778745Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1779746Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:43.1780854Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1781716Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1782209Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1782585Z     | ^
2025-05-07T20:32:43.1783223Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1784036Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1784611Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:43.1785345Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1785967Z     |     T=1,  # or any other generated value
2025-05-07T20:32:43.1786541Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:43.1787024Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:43.1787534Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:43.1788043Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:43.1788473Z     | )
2025-05-07T20:32:43.1788731Z     | 
2025-05-07T20:32:43.1789484Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1790355Z     +------------------------------------
2025-05-07T20:32:43.1790916Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:43.1791427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1792013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1792580Z     T=1,
2025-05-07T20:32:43.1792842Z     D=5120,
2025-05-07T20:32:43.1793125Z     scale_ub=None,
2025-05-07T20:32:43.1793436Z     contiguous=True,
2025-05-07T20:32:43.1793753Z     compiled=True,
2025-05-07T20:32:43.1794054Z )
2025-05-07T20:32:43.1794507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1795188Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1795561Z 
2025-05-07T20:32:43.1795673Z     @given(
2025-05-07T20:32:43.1796000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1796440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1796863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1797340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1797797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1798174Z     )
2025-05-07T20:32:43.1798645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1799278Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1799628Z         self,
2025-05-07T20:32:43.1799904Z         T: int,
2025-05-07T20:32:43.1800191Z         D: int,
2025-05-07T20:32:43.1800511Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1800896Z         contiguous: bool,
2025-05-07T20:32:43.1801241Z         compiled: bool,
2025-05-07T20:32:43.1801553Z     ) -> None:
2025-05-07T20:32:43.1801855Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1802214Z     
2025-05-07T20:32:43.1802614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1803081Z     
2025-05-07T20:32:43.1803350Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1803818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1804261Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1804619Z         x0 = x[:, :D]
2025-05-07T20:32:43.1804934Z         x1 = x[:, D:]
2025-05-07T20:32:43.1805226Z     
2025-05-07T20:32:43.1805494Z         if contiguous:
2025-05-07T20:32:43.1805836Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1806197Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1806546Z     
2025-05-07T20:32:43.1806821Z         if scale_ub is not None:
2025-05-07T20:32:43.1807231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1807695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1808133Z             )
2025-05-07T20:32:43.1808401Z         else:
2025-05-07T20:32:43.1830406Z             scale_ub_tensor = None
2025-05-07T20:32:43.1830825Z     
2025-05-07T20:32:43.1831155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1831570Z             op = silu_mul_quant
2025-05-07T20:32:43.1831909Z             if compiled:
2025-05-07T20:32:43.1832238Z                 op = torch.compile(op)
2025-05-07T20:32:43.1832641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1833002Z     
2025-05-07T20:32:43.1833262Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1833744Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1834186Z     
2025-05-07T20:32:43.1834513Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1834966Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1835375Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1835836Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1836354Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1836807Z     
2025-05-07T20:32:43.1837094Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1837380Z 
2025-05-07T20:32:43.1837526Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1838018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1838491Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1838954Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1840456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1841603Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1842380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1843331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1844295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1845333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1846349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.1847363Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1848356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1849241Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1850101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1850847Z     fn()
2025-05-07T20:32:43.1851568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1852402Z     self.fn.run(
2025-05-07T20:32:43.1853067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1853996Z     kernel = self.compile(
2025-05-07T20:32:43.1855574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1856500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1857046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1857363Z 
2025-05-07T20:32:43.1857654Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdceb7040>
2025-05-07T20:32:43.1859090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1860958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdcfd99d0>}
2025-05-07T20:32:43.1862884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1864315Z context = <triton._C.libtriton.ir.context object at 0x7fbfdd44ebf0>
2025-05-07T20:32:43.1864820Z 
2025-05-07T20:32:43.1865145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1865895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1866558Z                            module_map=module_map)
2025-05-07T20:32:43.1867056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1867533Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1867895Z E       ^
2025-05-07T20:32:43.1868529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1869235Z 
2025-05-07T20:32:43.1869832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1870579Z 
2025-05-07T20:32:43.1870734Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1871390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1871973Z     T=2048,
2025-05-07T20:32:43.1872237Z     D=5120,
2025-05-07T20:32:43.1872525Z     scale_ub=1200.0,
2025-05-07T20:32:43.1872859Z     contiguous=True,
2025-05-07T20:32:43.1873177Z     compiled=False,
2025-05-07T20:32:43.1873483Z )
2025-05-07T20:32:43.1873946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1874639Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.1875032Z 
2025-05-07T20:32:43.1875143Z     @given(
2025-05-07T20:32:43.1875473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1875932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1876373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1876857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1877340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1877761Z     )
2025-05-07T20:32:43.1878274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1878901Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1879242Z         self,
2025-05-07T20:32:43.1879520Z         T: int,
2025-05-07T20:32:43.1879810Z         D: int,
2025-05-07T20:32:43.1880119Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1880515Z         contiguous: bool,
2025-05-07T20:32:43.1880853Z         compiled: bool,
2025-05-07T20:32:43.1881169Z     ) -> None:
2025-05-07T20:32:43.1881476Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1881811Z     
2025-05-07T20:32:43.1882201Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1882734Z     
2025-05-07T20:32:43.1883018Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1883434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1883867Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1884219Z         x0 = x[:, :D]
2025-05-07T20:32:43.1884520Z         x1 = x[:, D:]
2025-05-07T20:32:43.1884793Z     
2025-05-07T20:32:43.1885061Z         if contiguous:
2025-05-07T20:32:43.1885392Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1885759Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1886114Z     
2025-05-07T20:32:43.1886405Z         if scale_ub is not None:
2025-05-07T20:32:43.1886803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1887290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1887729Z             )
2025-05-07T20:32:43.1888014Z         else:
2025-05-07T20:32:43.1888316Z             scale_ub_tensor = None
2025-05-07T20:32:43.1888687Z     
2025-05-07T20:32:43.1889021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1889473Z             op = silu_mul_quant
2025-05-07T20:32:43.1889829Z             if compiled:
2025-05-07T20:32:43.1890184Z                 op = torch.compile(op)
2025-05-07T20:32:43.1890588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1891054Z     
2025-05-07T20:32:43.1891344Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1891566Z 
2025-05-07T20:32:43.1891690Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1892084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1892503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1892877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1893797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1894782Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1895517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1896507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1897408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1898146Z     kernel = self.compile(
2025-05-07T20:32:43.1898867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1899758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1900313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1900624Z 
2025-05-07T20:32:43.1900920Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdd04a070>
2025-05-07T20:32:43.1902594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1904180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdbfe5e50>}
2025-05-07T20:32:43.1905535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1906556Z context = <triton._C.libtriton.ir.context object at 0x7fbfdbb2ebf0>
2025-05-07T20:32:43.1906847Z 
2025-05-07T20:32:43.1907015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1907545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1908013Z                            module_map=module_map)
2025-05-07T20:32:43.1908443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1908808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1909073Z E       ^
2025-05-07T20:32:43.1909541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1910001Z 
2025-05-07T20:32:43.1910420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1910938Z 
2025-05-07T20:32:43.1911042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1911462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1911866Z     T=2048,
2025-05-07T20:32:43.1912053Z     D=5120,
2025-05-07T20:32:43.1912250Z     scale_ub=1200.0,
2025-05-07T20:32:43.1912481Z     contiguous=True,
2025-05-07T20:32:43.1912701Z     compiled=True,
2025-05-07T20:32:43.1912909Z )
2025-05-07T20:32:43.1913235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1913731Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.1914009Z 
2025-05-07T20:32:43.1914087Z     @given(
2025-05-07T20:32:43.1914322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1914679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1915041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1915379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1915712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1915996Z     )
2025-05-07T20:32:43.1916358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1916806Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1917048Z         self,
2025-05-07T20:32:43.1917250Z         T: int,
2025-05-07T20:32:43.1917457Z         D: int,
2025-05-07T20:32:43.1917678Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1918033Z         contiguous: bool,
2025-05-07T20:32:43.1918285Z         compiled: bool,
2025-05-07T20:32:43.1918504Z     ) -> None:
2025-05-07T20:32:43.1918730Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1918985Z     
2025-05-07T20:32:43.1919257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1919618Z     
2025-05-07T20:32:43.1919816Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1920105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1920423Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1920676Z         x0 = x[:, :D]
2025-05-07T20:32:43.1920901Z         x1 = x[:, D:]
2025-05-07T20:32:43.1921105Z     
2025-05-07T20:32:43.1921301Z         if contiguous:
2025-05-07T20:32:43.1921539Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1921798Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1922047Z     
2025-05-07T20:32:43.1922246Z         if scale_ub is not None:
2025-05-07T20:32:43.1922520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1922864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1923179Z             )
2025-05-07T20:32:43.1923371Z         else:
2025-05-07T20:32:43.1923590Z             scale_ub_tensor = None
2025-05-07T20:32:43.1923853Z     
2025-05-07T20:32:43.1924088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1924412Z             op = silu_mul_quant
2025-05-07T20:32:43.1924684Z             if compiled:
2025-05-07T20:32:43.1924930Z                 op = torch.compile(op)
2025-05-07T20:32:43.1925238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1925517Z     
2025-05-07T20:32:43.1925712Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1926009Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1926307Z     
2025-05-07T20:32:43.1926544Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1926887Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1927242Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1927562Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1927926Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1928248Z     
2025-05-07T20:32:43.1928456Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1928656Z 
2025-05-07T20:32:43.1928760Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1929064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1929406Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1929732Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1930525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1931311Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1931884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1932573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1933264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1934081Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1934837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.1935594Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1936330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1936973Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1937582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1938160Z     fn()
2025-05-07T20:32:43.1938670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1939249Z     self.fn.run(
2025-05-07T20:32:43.1939711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1940538Z     kernel = self.compile(
2025-05-07T20:32:43.1941168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1941831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1942248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1942488Z 
2025-05-07T20:32:43.1942705Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdce8ea30>
2025-05-07T20:32:43.1943810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1945213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdbb7ca60>}
2025-05-07T20:32:43.1946580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1947599Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb8d1230>
2025-05-07T20:32:43.1947889Z 
2025-05-07T20:32:43.1948069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1948602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1949180Z                            module_map=module_map)
2025-05-07T20:32:43.1949558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1949922Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1950187Z E       ^
2025-05-07T20:32:43.1950652Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1951102Z 
2025-05-07T20:32:43.1951520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1952030Z 
2025-05-07T20:32:43.1952141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1952554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1952961Z     T=16384,
2025-05-07T20:32:43.1953161Z     D=7168,
2025-05-07T20:32:43.1953351Z     scale_ub=1200.0,
2025-05-07T20:32:43.1953585Z     contiguous=False,
2025-05-07T20:32:43.1953819Z     compiled=False,
2025-05-07T20:32:43.1954032Z )
2025-05-07T20:32:43.1954352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1954860Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.1955138Z 
2025-05-07T20:32:43.1955225Z     @given(
2025-05-07T20:32:43.1955591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1955924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1956238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1956569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1956910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1957209Z     )
2025-05-07T20:32:43.1957560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1958009Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1958262Z         self,
2025-05-07T20:32:43.1958456Z         T: int,
2025-05-07T20:32:43.1958734Z         D: int,
2025-05-07T20:32:43.1958958Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1959228Z         contiguous: bool,
2025-05-07T20:32:43.1959474Z         compiled: bool,
2025-05-07T20:32:43.1959708Z     ) -> None:
2025-05-07T20:32:43.1959931Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1960181Z     
2025-05-07T20:32:43.1960464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1960814Z     
2025-05-07T20:32:43.1961004Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1961355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1961674Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1961915Z         x0 = x[:, :D]
2025-05-07T20:32:43.1962141Z         x1 = x[:, D:]
2025-05-07T20:32:43.1962352Z     
2025-05-07T20:32:43.1962537Z         if contiguous:
2025-05-07T20:32:43.1962778Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1963043Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1963290Z     
2025-05-07T20:32:43.1963489Z         if scale_ub is not None:
2025-05-07T20:32:43.1963772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1964105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1964424Z             )
2025-05-07T20:32:43.1964628Z         else:
2025-05-07T20:32:43.1964847Z             scale_ub_tensor = None
2025-05-07T20:32:43.1965102Z     
2025-05-07T20:32:43.1965345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1965662Z             op = silu_mul_quant
2025-05-07T20:32:43.1965914Z             if compiled:
2025-05-07T20:32:43.1966174Z                 op = torch.compile(op)
2025-05-07T20:32:43.1966479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1966757Z     
2025-05-07T20:32:43.1966956Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1967122Z 
2025-05-07T20:32:43.1967233Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1967584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1967928Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1968217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1968912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1969614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1970164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1970852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1971511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1972055Z     kernel = self.compile(
2025-05-07T20:32:43.1972605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1973269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1973667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1973906Z 
2025-05-07T20:32:43.1974119Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdbaab610>
2025-05-07T20:32:43.1975257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1976670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb9e5670>}
2025-05-07T20:32:43.1978016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1979094Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb887630>
2025-05-07T20:32:43.1979392Z 
2025-05-07T20:32:43.1979563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1980097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1980575Z                            module_map=module_map)
2025-05-07T20:32:43.1980997Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1981437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1981703Z E       ^
2025-05-07T20:32:43.1982160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1982616Z 
2025-05-07T20:32:43.1983031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1983541Z 
2025-05-07T20:32:43.1983652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1984067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1984474Z     T=1,
2025-05-07T20:32:43.1984666Z     D=7168,
2025-05-07T20:32:43.1984865Z     scale_ub=None,
2025-05-07T20:32:43.1985079Z     contiguous=True,
2025-05-07T20:32:43.1985311Z     compiled=True,
2025-05-07T20:32:43.1985525Z )
2025-05-07T20:32:43.1985845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1986330Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1986592Z 
2025-05-07T20:32:43.1986677Z     @given(
2025-05-07T20:32:43.1986907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1987230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1987546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1987877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1988270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1988573Z     )
2025-05-07T20:32:43.1988929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1989371Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1989618Z         self,
2025-05-07T20:32:43.1989824Z         T: int,
2025-05-07T20:32:43.1990026Z         D: int,
2025-05-07T20:32:43.1990250Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1990533Z         contiguous: bool,
2025-05-07T20:32:43.1990774Z         compiled: bool,
2025-05-07T20:32:43.1991004Z     ) -> None:
2025-05-07T20:32:43.1991226Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1991468Z     
2025-05-07T20:32:43.1991747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1992100Z     
2025-05-07T20:32:43.1992295Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1992596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1992912Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1993158Z         x0 = x[:, :D]
2025-05-07T20:32:43.1993377Z         x1 = x[:, D:]
2025-05-07T20:32:43.1993588Z     
2025-05-07T20:32:43.1993775Z         if contiguous:
2025-05-07T20:32:43.1994004Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1994267Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1994568Z     
2025-05-07T20:32:43.1994820Z         if scale_ub is not None:
2025-05-07T20:32:43.1995102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1995444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1995751Z             )
2025-05-07T20:32:43.1995947Z         else:
2025-05-07T20:32:43.1996162Z             scale_ub_tensor = None
2025-05-07T20:32:43.1996412Z     
2025-05-07T20:32:43.1996653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1996969Z             op = silu_mul_quant
2025-05-07T20:32:43.1997224Z             if compiled:
2025-05-07T20:32:43.1997477Z                 op = torch.compile(op)
2025-05-07T20:32:43.1997837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1998116Z     
2025-05-07T20:32:43.1998314Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1998602Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1998897Z     
2025-05-07T20:32:43.1999136Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1999481Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1999782Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2008407Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2008786Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2009113Z     
2025-05-07T20:32:43.2009328Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2009529Z 
2025-05-07T20:32:43.2009641Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2009937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2010297Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2010642Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2011429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2012209Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2012768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2013460Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2014147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2014871Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2015633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2016472Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2017203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2017851Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2018461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2018978Z     fn()
2025-05-07T20:32:43.2019498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2020091Z     self.fn.run(
2025-05-07T20:32:43.2020563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2021176Z     kernel = self.compile(
2025-05-07T20:32:43.2021734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2022392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2022790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2023029Z 
2025-05-07T20:32:43.2023288Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb869130>
2025-05-07T20:32:43.2024109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2024629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdba28dc0>}
2025-05-07T20:32:43.2025379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2025642Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb5492f0>
2025-05-07T20:32:43.2025647Z 
2025-05-07T20:32:43.2025817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2026096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2026207Z                            module_map=module_map)
2025-05-07T20:32:43.2026374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2026488Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2026568Z E       ^
2025-05-07T20:32:43.2026928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2026932Z 
2025-05-07T20:32:43.2027355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2027362Z 
2025-05-07T20:32:43.2027467Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2027706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2027785Z     T=4096,
2025-05-07T20:32:43.2027872Z     D=5120,
2025-05-07T20:32:43.2027958Z     scale_ub=None,
2025-05-07T20:32:43.2028045Z     contiguous=False,
2025-05-07T20:32:43.2028131Z     compiled=False,
2025-05-07T20:32:43.2028203Z )
2025-05-07T20:32:43.2028416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2028591Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2028595Z 
2025-05-07T20:32:43.2028671Z     @given(
2025-05-07T20:32:43.2028791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2028896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2029011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2029183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2029304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2029381Z     )
2025-05-07T20:32:43.2029640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2029740Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2029827Z         self,
2025-05-07T20:32:43.2029914Z         T: int,
2025-05-07T20:32:43.2029994Z         D: int,
2025-05-07T20:32:43.2030094Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2030192Z         contiguous: bool,
2025-05-07T20:32:43.2030281Z         compiled: bool,
2025-05-07T20:32:43.2030369Z     ) -> None:
2025-05-07T20:32:43.2030475Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2030550Z     
2025-05-07T20:32:43.2030728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2030805Z     
2025-05-07T20:32:43.2030899Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2031040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2031134Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2031220Z         x0 = x[:, :D]
2025-05-07T20:32:43.2031310Z         x1 = x[:, D:]
2025-05-07T20:32:43.2031387Z     
2025-05-07T20:32:43.2031472Z         if contiguous:
2025-05-07T20:32:43.2031576Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2031748Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2031824Z     
2025-05-07T20:32:43.2031924Z         if scale_ub is not None:
2025-05-07T20:32:43.2032031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2032169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2032256Z             )
2025-05-07T20:32:43.2032336Z         else:
2025-05-07T20:32:43.2032445Z             scale_ub_tensor = None
2025-05-07T20:32:43.2032520Z     
2025-05-07T20:32:43.2032650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2032752Z             op = silu_mul_quant
2025-05-07T20:32:43.2032883Z             if compiled:
2025-05-07T20:32:43.2032991Z                 op = torch.compile(op)
2025-05-07T20:32:43.2033112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2033188Z     
2025-05-07T20:32:43.2033283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2033288Z 
2025-05-07T20:32:43.2033395Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2033533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2033644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2033747Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2034253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2034360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2034725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2034956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2035309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2035406Z     kernel = self.compile(
2025-05-07T20:32:43.2035795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2035978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2036106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2036110Z 
2025-05-07T20:32:43.2036327Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb5ef6a0>
2025-05-07T20:32:43.2037095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2037659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb6393a0>}
2025-05-07T20:32:43.2038415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2038614Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaf670f0>
2025-05-07T20:32:43.2038625Z 
2025-05-07T20:32:43.2038795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2039058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2039173Z                            module_map=module_map)
2025-05-07T20:32:43.2039341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2039443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2039535Z E       ^
2025-05-07T20:32:43.2039893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2039898Z 
2025-05-07T20:32:43.2040591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2040740Z 
2025-05-07T20:32:43.2040854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2041081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2041168Z     T=4096,
2025-05-07T20:32:43.2041244Z     D=7168,
2025-05-07T20:32:43.2041327Z     scale_ub=None,
2025-05-07T20:32:43.2041420Z     contiguous=False,
2025-05-07T20:32:43.2041506Z     compiled=False,
2025-05-07T20:32:43.2041581Z )
2025-05-07T20:32:43.2041805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2041981Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2042050Z 
2025-05-07T20:32:43.2042137Z     @given(
2025-05-07T20:32:43.2042264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2042365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2042489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2042614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2042733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2042817Z     )
2025-05-07T20:32:43.2043066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2043171Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2043249Z         self,
2025-05-07T20:32:43.2043328Z         T: int,
2025-05-07T20:32:43.2043412Z         D: int,
2025-05-07T20:32:43.2043511Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2043602Z         contiguous: bool,
2025-05-07T20:32:43.2043696Z         compiled: bool,
2025-05-07T20:32:43.2043778Z     ) -> None:
2025-05-07T20:32:43.2043882Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2043968Z     
2025-05-07T20:32:43.2044141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2044219Z     
2025-05-07T20:32:43.2044320Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2044451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2044545Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2044634Z         x0 = x[:, :D]
2025-05-07T20:32:43.2044719Z         x1 = x[:, D:]
2025-05-07T20:32:43.2044804Z     
2025-05-07T20:32:43.2044890Z         if contiguous:
2025-05-07T20:32:43.2044984Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2045082Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2045156Z     
2025-05-07T20:32:43.2045249Z         if scale_ub is not None:
2025-05-07T20:32:43.2045364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2045502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2045645Z             )
2025-05-07T20:32:43.2045734Z         else:
2025-05-07T20:32:43.2045830Z             scale_ub_tensor = None
2025-05-07T20:32:43.2045905Z     
2025-05-07T20:32:43.2046046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2046140Z             op = silu_mul_quant
2025-05-07T20:32:43.2046240Z             if compiled:
2025-05-07T20:32:43.2046343Z                 op = torch.compile(op)
2025-05-07T20:32:43.2046450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2046535Z     
2025-05-07T20:32:43.2046627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2046631Z 
2025-05-07T20:32:43.2046732Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2046867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2046974Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2047076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2047583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2047685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2048049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2048343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2048725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2048829Z     kernel = self.compile(
2025-05-07T20:32:43.2049205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2049386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2049513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2049517Z 
2025-05-07T20:32:43.2049731Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdd045610>
2025-05-07T20:32:43.2050536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2051056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb5a78b0>}
2025-05-07T20:32:43.2051812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2052007Z context = <triton._C.libtriton.ir.context object at 0x7fbfdb4c96b0>
2025-05-07T20:32:43.2052012Z 
2025-05-07T20:32:43.2052188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2052461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2052571Z                            module_map=module_map)
2025-05-07T20:32:43.2052740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2052839Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2052928Z E       ^
2025-05-07T20:32:43.2053278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2053283Z 
2025-05-07T20:32:43.2053698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2053702Z 
2025-05-07T20:32:43.2053811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2054034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2054116Z     T=128,
2025-05-07T20:32:43.2054193Z     D=7168,
2025-05-07T20:32:43.2054276Z     scale_ub=None,
2025-05-07T20:32:43.2054415Z     contiguous=False,
2025-05-07T20:32:43.2054503Z     compiled=True,
2025-05-07T20:32:43.2054580Z )
2025-05-07T20:32:43.2054801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2054971Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2054982Z 
2025-05-07T20:32:43.2055059Z     @given(
2025-05-07T20:32:43.2055183Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2055282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2055406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2055523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2055640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2055724Z     )
2025-05-07T20:32:43.2055971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2056067Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2056155Z         self,
2025-05-07T20:32:43.2056232Z         T: int,
2025-05-07T20:32:43.2056308Z         D: int,
2025-05-07T20:32:43.2056415Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2056504Z         contiguous: bool,
2025-05-07T20:32:43.2056591Z         compiled: bool,
2025-05-07T20:32:43.2056676Z     ) -> None:
2025-05-07T20:32:43.2056851Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2056930Z     
2025-05-07T20:32:43.2057098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2057172Z     
2025-05-07T20:32:43.2057270Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2057396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2057487Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2057577Z         x0 = x[:, :D]
2025-05-07T20:32:43.2057657Z         x1 = x[:, D:]
2025-05-07T20:32:43.2057731Z     
2025-05-07T20:32:43.2057821Z         if contiguous:
2025-05-07T20:32:43.2057912Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2058049Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2058127Z     
2025-05-07T20:32:43.2058219Z         if scale_ub is not None:
2025-05-07T20:32:43.2058326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2058468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2058549Z             )
2025-05-07T20:32:43.2058634Z         else:
2025-05-07T20:32:43.2058728Z             scale_ub_tensor = None
2025-05-07T20:32:43.2058800Z     
2025-05-07T20:32:43.2058937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2059028Z             op = silu_mul_quant
2025-05-07T20:32:43.2059115Z             if compiled:
2025-05-07T20:32:43.2059222Z                 op = torch.compile(op)
2025-05-07T20:32:43.2059329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2059402Z     
2025-05-07T20:32:43.2059500Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2059620Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2059700Z     
2025-05-07T20:32:43.2059841Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2059945Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2060050Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2060169Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2060315Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2060398Z     
2025-05-07T20:32:43.2060497Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2060501Z 
2025-05-07T20:32:43.2060623Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2060772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2060888Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2061029Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2061685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2061791Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2062155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2062382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2062753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2063015Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2063416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2063674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2064046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2064214Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2064557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2064634Z     fn()
2025-05-07T20:32:43.2065110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2065194Z     self.fn.run(
2025-05-07T20:32:43.2065528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2065628Z     kernel = self.compile(
2025-05-07T20:32:43.2066010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2066188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2066325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2066368Z 
2025-05-07T20:32:43.2066576Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdafc6dc0>
2025-05-07T20:32:43.2067352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2067858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb5a7e50>}
2025-05-07T20:32:43.2068600Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2068795Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaecec30>
2025-05-07T20:32:43.2068799Z 
2025-05-07T20:32:43.2068973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2069246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2069355Z                            module_map=module_map)
2025-05-07T20:32:43.2069521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2069637Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2069715Z E       ^
2025-05-07T20:32:43.2070070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2070075Z 
2025-05-07T20:32:43.2070484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2070489Z 
2025-05-07T20:32:43.2070590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2070817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2070940Z     T=128,
2025-05-07T20:32:43.2071025Z     D=7168,
2025-05-07T20:32:43.2071108Z     scale_ub=None,
2025-05-07T20:32:43.2071195Z     contiguous=False,
2025-05-07T20:32:43.2071284Z     compiled=False,
2025-05-07T20:32:43.2071357Z )
2025-05-07T20:32:43.2071573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2071754Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2071758Z 
2025-05-07T20:32:43.2071835Z     @given(
2025-05-07T20:32:43.2071956Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2072060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2072174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2072293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2072408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2072484Z     )
2025-05-07T20:32:43.2072737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2072835Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2072913Z         self,
2025-05-07T20:32:43.2072996Z         T: int,
2025-05-07T20:32:43.2073073Z         D: int,
2025-05-07T20:32:43.2073171Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2073268Z         contiguous: bool,
2025-05-07T20:32:43.2073435Z         compiled: bool,
2025-05-07T20:32:43.2073516Z     ) -> None:
2025-05-07T20:32:43.2073617Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2073691Z     
2025-05-07T20:32:43.2073863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2073937Z     
2025-05-07T20:32:43.2074027Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2074158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2074248Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2074329Z         x0 = x[:, :D]
2025-05-07T20:32:43.2074418Z         x1 = x[:, D:]
2025-05-07T20:32:43.2074494Z     
2025-05-07T20:32:43.2074633Z         if contiguous:
2025-05-07T20:32:43.2074739Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2074831Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2074902Z     
2025-05-07T20:32:43.2074998Z         if scale_ub is not None:
2025-05-07T20:32:43.2075104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2075249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2075331Z             )
2025-05-07T20:32:43.2075411Z         else:
2025-05-07T20:32:43.2075505Z             scale_ub_tensor = None
2025-05-07T20:32:43.2075583Z     
2025-05-07T20:32:43.2075716Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2075806Z             op = silu_mul_quant
2025-05-07T20:32:43.2075897Z             if compiled:
2025-05-07T20:32:43.2075998Z                 op = torch.compile(op)
2025-05-07T20:32:43.2076112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2076185Z     
2025-05-07T20:32:43.2076283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2076288Z 
2025-05-07T20:32:43.2076393Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2076521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2076623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2076729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2077238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2077336Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2077694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2077923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2078268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2078362Z     kernel = self.compile(
2025-05-07T20:32:43.2078816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2079003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2079131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2079141Z 
2025-05-07T20:32:43.2079353Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdb0b6970>
2025-05-07T20:32:43.2080122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2080625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb08faf0>}
2025-05-07T20:32:43.2081381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2081578Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaa5dc70>
2025-05-07T20:32:43.2081582Z 
2025-05-07T20:32:43.2081845Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2082117Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2082223Z                            module_map=module_map)
2025-05-07T20:32:43.2082388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2082485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2082565Z E       ^
2025-05-07T20:32:43.2082919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2082924Z 
2025-05-07T20:32:43.2083381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2083386Z 
2025-05-07T20:32:43.2083498Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2083718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2083807Z     T=4096,
2025-05-07T20:32:43.2083885Z     D=5120,
2025-05-07T20:32:43.2083968Z     scale_ub=1200.0,
2025-05-07T20:32:43.2084058Z     contiguous=True,
2025-05-07T20:32:43.2084144Z     compiled=False,
2025-05-07T20:32:43.2084218Z )
2025-05-07T20:32:43.2084437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2084612Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2084616Z 
2025-05-07T20:32:43.2084692Z     @given(
2025-05-07T20:32:43.2084815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2084914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2085034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2085158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2085273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2085354Z     )
2025-05-07T20:32:43.2085602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2085699Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2085780Z         self,
2025-05-07T20:32:43.2085856Z         T: int,
2025-05-07T20:32:43.2085935Z         D: int,
2025-05-07T20:32:43.2086036Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2086124Z         contiguous: bool,
2025-05-07T20:32:43.2086210Z         compiled: bool,
2025-05-07T20:32:43.2086294Z     ) -> None:
2025-05-07T20:32:43.2086390Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2086465Z     
2025-05-07T20:32:43.2086637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2086711Z     
2025-05-07T20:32:43.2086857Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2086983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2087073Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2087157Z         x0 = x[:, :D]
2025-05-07T20:32:43.2087237Z         x1 = x[:, D:]
2025-05-07T20:32:43.2087313Z     
2025-05-07T20:32:43.2087402Z         if contiguous:
2025-05-07T20:32:43.2087493Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2087580Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2087654Z     
2025-05-07T20:32:43.2087743Z         if scale_ub is not None:
2025-05-07T20:32:43.2087848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2087985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2088060Z             )
2025-05-07T20:32:43.2088142Z         else:
2025-05-07T20:32:43.2088234Z             scale_ub_tensor = None
2025-05-07T20:32:43.2088307Z     
2025-05-07T20:32:43.2088446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2088543Z             op = silu_mul_quant
2025-05-07T20:32:43.2088628Z             if compiled:
2025-05-07T20:32:43.2088734Z                 op = torch.compile(op)
2025-05-07T20:32:43.2088840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2088915Z     
2025-05-07T20:32:43.2089092Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2089097Z 
2025-05-07T20:32:43.2089198Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2089330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2089430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2089528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2090027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2090123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2090481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2090760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2091141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2091238Z     kernel = self.compile(
2025-05-07T20:32:43.2091620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2091798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2091927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2091931Z 
2025-05-07T20:32:43.2092138Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaa70160>
2025-05-07T20:32:43.2092908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2093419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdb165ca0>}
2025-05-07T20:32:43.2094156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2094355Z context = <triton._C.libtriton.ir.context object at 0x7fbfdaec1a30>
2025-05-07T20:32:43.2094360Z 
2025-05-07T20:32:43.2094527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2094798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2094906Z                            module_map=module_map)
2025-05-07T20:32:43.2095111Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2095220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2095298Z E       ^
2025-05-07T20:32:43.2095654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2095664Z 
2025-05-07T20:32:43.2096084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2096089Z 
2025-05-07T20:32:43.2096194Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2096421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2096498Z     T=1,
2025-05-07T20:32:43.2096574Z     D=5120,
2025-05-07T20:32:43.2096662Z     scale_ub=None,
2025-05-07T20:32:43.2096747Z     contiguous=True,
2025-05-07T20:32:43.2096829Z     compiled=True,
2025-05-07T20:32:43.2096906Z )
2025-05-07T20:32:43.2097125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2097296Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2097301Z 
2025-05-07T20:32:43.2097378Z     @given(
2025-05-07T20:32:43.2097497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2097606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2097800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2097919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2098040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2098114Z     )
2025-05-07T20:32:43.2098362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2098460Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2098537Z         self,
2025-05-07T20:32:43.2098620Z         T: int,
2025-05-07T20:32:43.2098697Z         D: int,
2025-05-07T20:32:43.2098796Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2098889Z         contiguous: bool,
2025-05-07T20:32:43.2099020Z         compiled: bool,
2025-05-07T20:32:43.2099100Z     ) -> None:
2025-05-07T20:32:43.2099199Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2099271Z     
2025-05-07T20:32:43.2099439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2099515Z     
2025-05-07T20:32:43.2099616Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2099740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2099836Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2099916Z         x0 = x[:, :D]
2025-05-07T20:32:43.2100000Z         x1 = x[:, D:]
2025-05-07T20:32:43.2100076Z     
2025-05-07T20:32:43.2100159Z         if contiguous:
2025-05-07T20:32:43.2100255Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2100345Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2100418Z     
2025-05-07T20:32:43.2100515Z         if scale_ub is not None:
2025-05-07T20:32:43.2100619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2100762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2100845Z             )
2025-05-07T20:32:43.2100921Z         else:
2025-05-07T20:32:43.2101014Z             scale_ub_tensor = None
2025-05-07T20:32:43.2101147Z     
2025-05-07T20:32:43.2101276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2101372Z             op = silu_mul_quant
2025-05-07T20:32:43.2101463Z             if compiled:
2025-05-07T20:32:43.2101563Z                 op = torch.compile(op)
2025-05-07T20:32:43.2101674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2101747Z     
2025-05-07T20:32:43.2101838Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2101964Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2102038Z     
2025-05-07T20:32:43.2102176Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2102285Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2102427Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2102555Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2102698Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2102773Z     
2025-05-07T20:32:43.2102878Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2102885Z 
2025-05-07T20:32:43.2102986Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2103111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2103218Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2103355Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2103906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2104009Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2104368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2104597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2104960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2105255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2105695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2105945Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2106329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2106493Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2106840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2106984Z     fn()
2025-05-07T20:32:43.2107387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2107469Z     self.fn.run(
2025-05-07T20:32:43.2107809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2107906Z     kernel = self.compile(
2025-05-07T20:32:43.2108283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2108457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2108583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2108588Z 
2025-05-07T20:32:43.2108799Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaa72c40>
2025-05-07T20:32:43.2109569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2110082Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda9b6550>}
2025-05-07T20:32:43.2110828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2111019Z context = <triton._C.libtriton.ir.context object at 0x7fbfda9a0070>
2025-05-07T20:32:43.2111024Z 
2025-05-07T20:32:43.2111193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2111453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2111608Z                            module_map=module_map)
2025-05-07T20:32:43.2111774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2111877Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2111960Z E       ^
2025-05-07T20:32:43.2112314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2112321Z 
2025-05-07T20:32:43.2112734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2112738Z 
2025-05-07T20:32:43.2112841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2113063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2113144Z     T=2048,
2025-05-07T20:32:43.2113218Z     D=5120,
2025-05-07T20:32:43.2113300Z     scale_ub=None,
2025-05-07T20:32:43.2113385Z     contiguous=True,
2025-05-07T20:32:43.2113468Z     compiled=True,
2025-05-07T20:32:43.2113544Z )
2025-05-07T20:32:43.2113765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2113938Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2113942Z 
2025-05-07T20:32:43.2114024Z     @given(
2025-05-07T20:32:43.2114142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2114314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2114434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2114552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2114664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2114741Z     )
2025-05-07T20:32:43.2114984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2115077Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2115157Z         self,
2025-05-07T20:32:43.2115233Z         T: int,
2025-05-07T20:32:43.2115313Z         D: int,
2025-05-07T20:32:43.2115457Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2115546Z         contiguous: bool,
2025-05-07T20:32:43.2115634Z         compiled: bool,
2025-05-07T20:32:43.2115714Z     ) -> None:
2025-05-07T20:32:43.2115806Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2115884Z     
2025-05-07T20:32:43.2116055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2116132Z     
2025-05-07T20:32:43.2116227Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2116349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2116437Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2116523Z         x0 = x[:, :D]
2025-05-07T20:32:43.2116604Z         x1 = x[:, D:]
2025-05-07T20:32:43.2116676Z     
2025-05-07T20:32:43.2116761Z         if contiguous:
2025-05-07T20:32:43.2116851Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2116946Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2117019Z     
2025-05-07T20:32:43.2117109Z         if scale_ub is not None:
2025-05-07T20:32:43.2117221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2117354Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2117428Z             )
2025-05-07T20:32:43.2117509Z         else:
2025-05-07T20:32:43.2117601Z             scale_ub_tensor = None
2025-05-07T20:32:43.2117679Z     
2025-05-07T20:32:43.2117814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2117904Z             op = silu_mul_quant
2025-05-07T20:32:43.2117997Z             if compiled:
2025-05-07T20:32:43.2118101Z                 op = torch.compile(op)
2025-05-07T20:32:43.2118204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2118281Z     
2025-05-07T20:32:43.2118371Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2118490Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2118565Z     
2025-05-07T20:32:43.2118701Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2118849Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2118953Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2119073Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2119214Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2119291Z     
2025-05-07T20:32:43.2119396Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2119401Z 
2025-05-07T20:32:43.2119503Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2119629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2119733Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2119869Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2120419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2120519Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2120886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2121130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2121580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2121867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2122259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2122513Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2122888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2123058Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2123437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2123513Z     fn()
2025-05-07T20:32:43.2123909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2123993Z     self.fn.run(
2025-05-07T20:32:43.2124333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2124429Z     kernel = self.compile(
2025-05-07T20:32:43.2124802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2124983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2125109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2125113Z 
2025-05-07T20:32:43.2125323Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdae68d30>
2025-05-07T20:32:43.2126101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2126617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda8f0f70>}
2025-05-07T20:32:43.2127358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2127551Z context = <triton._C.libtriton.ir.context object at 0x7fbfdac0c6f0>
2025-05-07T20:32:43.2127555Z 
2025-05-07T20:32:43.2127722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2128027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2128139Z                            module_map=module_map)
2025-05-07T20:32:43.2128306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2128408Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2128483Z E       ^
2025-05-07T20:32:43.2128845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2128850Z 
2025-05-07T20:32:43.2129257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2129261Z 
2025-05-07T20:32:43.2129368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2129591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2129667Z     T=128,
2025-05-07T20:32:43.2129744Z     D=5120,
2025-05-07T20:32:43.2129826Z     scale_ub=None,
2025-05-07T20:32:43.2129915Z     contiguous=True,
2025-05-07T20:32:43.2130003Z     compiled=True,
2025-05-07T20:32:43.2130076Z )
2025-05-07T20:32:43.2130291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2130461Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2130506Z 
2025-05-07T20:32:43.2130620Z     @given(
2025-05-07T20:32:43.2130754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2130866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2131002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2131123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2131234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2131306Z     )
2025-05-07T20:32:43.2131553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2131645Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2131723Z         self,
2025-05-07T20:32:43.2131851Z         T: int,
2025-05-07T20:32:43.2131928Z         D: int,
2025-05-07T20:32:43.2132028Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2132117Z         contiguous: bool,
2025-05-07T20:32:43.2132201Z         compiled: bool,
2025-05-07T20:32:43.2132285Z     ) -> None:
2025-05-07T20:32:43.2132383Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2132455Z     
2025-05-07T20:32:43.2132626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2132700Z     
2025-05-07T20:32:43.2132792Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2132919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2133007Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2133089Z         x0 = x[:, :D]
2025-05-07T20:32:43.2133171Z         x1 = x[:, D:]
2025-05-07T20:32:43.2133243Z     
2025-05-07T20:32:43.2133325Z         if contiguous:
2025-05-07T20:32:43.2133421Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2133511Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2133593Z     
2025-05-07T20:32:43.2133683Z         if scale_ub is not None:
2025-05-07T20:32:43.2133786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2133925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2134000Z             )
2025-05-07T20:32:43.2134080Z         else:
2025-05-07T20:32:43.2134177Z             scale_ub_tensor = None
2025-05-07T20:32:43.2134249Z     
2025-05-07T20:32:43.2134378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2134471Z             op = silu_mul_quant
2025-05-07T20:32:43.2134555Z             if compiled:
2025-05-07T20:32:43.2134653Z                 op = torch.compile(op)
2025-05-07T20:32:43.2134763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2134839Z     
2025-05-07T20:32:43.2134937Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2135059Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2135138Z     
2025-05-07T20:32:43.2135326Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2135432Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2135534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2135665Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2140385Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2140520Z     
2025-05-07T20:32:43.2140675Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2140681Z 
2025-05-07T20:32:43.2140787Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2140918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2141031Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2141235Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2141814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2141923Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2142283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2142515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2143093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2143356Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2143755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2144008Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2144390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2144624Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2144971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2145052Z     fn()
2025-05-07T20:32:43.2145457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2145547Z     self.fn.run(
2025-05-07T20:32:43.2145879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2145974Z     kernel = self.compile(
2025-05-07T20:32:43.2146352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2146531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2146657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2146670Z 
2025-05-07T20:32:43.2146882Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdad3fe80>
2025-05-07T20:32:43.2147666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2148181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdac8fb80>}
2025-05-07T20:32:43.2148930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2149128Z context = <triton._C.libtriton.ir.context object at 0x7fbfda78a270>
2025-05-07T20:32:43.2149133Z 
2025-05-07T20:32:43.2149362Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2149636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2149749Z                            module_map=module_map)
2025-05-07T20:32:43.2149910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2150016Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2150100Z E       ^
2025-05-07T20:32:43.2150456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2150461Z 
2025-05-07T20:32:43.2150875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2150879Z 
2025-05-07T20:32:43.2150982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2151204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2151287Z     T=4096,
2025-05-07T20:32:43.2151370Z     D=5120,
2025-05-07T20:32:43.2151460Z     scale_ub=None,
2025-05-07T20:32:43.2151544Z     contiguous=True,
2025-05-07T20:32:43.2151625Z     compiled=True,
2025-05-07T20:32:43.2151704Z )
2025-05-07T20:32:43.2151919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2152173Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2152179Z 
2025-05-07T20:32:43.2152261Z     @given(
2025-05-07T20:32:43.2152380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2152479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2152605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2152720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2152839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2152913Z     )
2025-05-07T20:32:43.2153160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2153301Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2153378Z         self,
2025-05-07T20:32:43.2153455Z         T: int,
2025-05-07T20:32:43.2153539Z         D: int,
2025-05-07T20:32:43.2153634Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2153724Z         contiguous: bool,
2025-05-07T20:32:43.2153821Z         compiled: bool,
2025-05-07T20:32:43.2153901Z     ) -> None:
2025-05-07T20:32:43.2153994Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2154070Z     
2025-05-07T20:32:43.2154240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2154315Z     
2025-05-07T20:32:43.2154408Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2154533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2154629Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2154710Z         x0 = x[:, :D]
2025-05-07T20:32:43.2154788Z         x1 = x[:, D:]
2025-05-07T20:32:43.2154861Z     
2025-05-07T20:32:43.2154943Z         if contiguous:
2025-05-07T20:32:43.2155038Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2155136Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2155207Z     
2025-05-07T20:32:43.2155296Z         if scale_ub is not None:
2025-05-07T20:32:43.2155403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2155542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2155618Z             )
2025-05-07T20:32:43.2155702Z         else:
2025-05-07T20:32:43.2155796Z             scale_ub_tensor = None
2025-05-07T20:32:43.2155876Z     
2025-05-07T20:32:43.2156005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2156097Z             op = silu_mul_quant
2025-05-07T20:32:43.2156189Z             if compiled:
2025-05-07T20:32:43.2156288Z                 op = torch.compile(op)
2025-05-07T20:32:43.2156394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2156470Z     
2025-05-07T20:32:43.2156559Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2156730Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2156808Z     
2025-05-07T20:32:43.2156943Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2157045Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2157150Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2157277Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2157419Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2157491Z     
2025-05-07T20:32:43.2157591Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2157595Z 
2025-05-07T20:32:43.2157698Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2157828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2157932Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2158072Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2158632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2158738Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2159101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2159406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2159781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2160039Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2160437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2160709Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2161160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2161329Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2161673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2161754Z     fn()
2025-05-07T20:32:43.2162156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2162241Z     self.fn.run(
2025-05-07T20:32:43.2162576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2162670Z     kernel = self.compile(
2025-05-07T20:32:43.2163052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2163234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2163368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2163373Z 
2025-05-07T20:32:43.2163583Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdaba2a30>
2025-05-07T20:32:43.2164357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2164879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda81cca0>}
2025-05-07T20:32:43.2165621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2165815Z context = <triton._C.libtriton.ir.context object at 0x7fbfda333bf0>
2025-05-07T20:32:43.2165860Z 
2025-05-07T20:32:43.2166035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2166301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2166407Z                            module_map=module_map)
2025-05-07T20:32:43.2166581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2166684Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2166762Z E       ^
2025-05-07T20:32:43.2167129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2167133Z 
2025-05-07T20:32:43.2167547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2167551Z 
2025-05-07T20:32:43.2167658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2167883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2167963Z     T=16384,
2025-05-07T20:32:43.2168044Z     D=5120,
2025-05-07T20:32:43.2168126Z     scale_ub=None,
2025-05-07T20:32:43.2168211Z     contiguous=True,
2025-05-07T20:32:43.2168302Z     compiled=True,
2025-05-07T20:32:43.2168374Z )
2025-05-07T20:32:43.2168668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2168844Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2168848Z 
2025-05-07T20:32:43.2168924Z     @given(
2025-05-07T20:32:43.2169050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2169151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2169267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2169390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2169502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2169583Z     )
2025-05-07T20:32:43.2169874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2169968Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2170048Z         self,
2025-05-07T20:32:43.2170125Z         T: int,
2025-05-07T20:32:43.2170203Z         D: int,
2025-05-07T20:32:43.2170310Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2170399Z         contiguous: bool,
2025-05-07T20:32:43.2170486Z         compiled: bool,
2025-05-07T20:32:43.2170572Z     ) -> None:
2025-05-07T20:32:43.2170664Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2170736Z     
2025-05-07T20:32:43.2170909Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2170982Z     
2025-05-07T20:32:43.2171073Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2171203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2171292Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2171377Z         x0 = x[:, :D]
2025-05-07T20:32:43.2171463Z         x1 = x[:, D:]
2025-05-07T20:32:43.2171535Z     
2025-05-07T20:32:43.2171625Z         if contiguous:
2025-05-07T20:32:43.2171716Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2171803Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2171878Z     
2025-05-07T20:32:43.2171971Z         if scale_ub is not None:
2025-05-07T20:32:43.2172081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2172221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2172297Z             )
2025-05-07T20:32:43.2172375Z         else:
2025-05-07T20:32:43.2172475Z             scale_ub_tensor = None
2025-05-07T20:32:43.2172550Z     
2025-05-07T20:32:43.2172685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2172778Z             op = silu_mul_quant
2025-05-07T20:32:43.2172865Z             if compiled:
2025-05-07T20:32:43.2172969Z                 op = torch.compile(op)
2025-05-07T20:32:43.2173078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2173197Z     
2025-05-07T20:32:43.2173293Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2173415Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2173486Z     
2025-05-07T20:32:43.2173625Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2173734Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2173834Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2173959Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2174098Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2174177Z     
2025-05-07T20:32:43.2174276Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2174281Z 
2025-05-07T20:32:43.2174379Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2174516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2174620Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2174760Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2175316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2175417Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2175888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2176117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2176485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2176742Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2177140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2177393Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2177809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2177978Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2178320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2178398Z     fn()
2025-05-07T20:32:43.2178799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2178881Z     self.fn.run(
2025-05-07T20:32:43.2179225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2179319Z     kernel = self.compile(
2025-05-07T20:32:43.2179700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2179890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2180017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2180022Z 
2025-05-07T20:32:43.2180235Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdae66be0>
2025-05-07T20:32:43.2181024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2181582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdaddcb80>}
2025-05-07T20:32:43.2182374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2182570Z context = <triton._C.libtriton.ir.context object at 0x7fbfda049530>
2025-05-07T20:32:43.2182575Z 
2025-05-07T20:32:43.2182745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2183009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2183127Z                            module_map=module_map)
2025-05-07T20:32:43.2183287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2183391Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2183472Z E       ^
2025-05-07T20:32:43.2183830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2183834Z 
2025-05-07T20:32:43.2184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2184254Z 
2025-05-07T20:32:43.2184366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2184591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2184673Z     T=1,
2025-05-07T20:32:43.2184749Z     D=5120,
2025-05-07T20:32:43.2184832Z     scale_ub=1200.0,
2025-05-07T20:32:43.2184918Z     contiguous=True,
2025-05-07T20:32:43.2185078Z     compiled=True,
2025-05-07T20:32:43.2185151Z )
2025-05-07T20:32:43.2185372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2185537Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2185542Z 
2025-05-07T20:32:43.2185620Z     @given(
2025-05-07T20:32:43.2185741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2185840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2185960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2186077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2186238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2186317Z     )
2025-05-07T20:32:43.2186566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2186659Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2186739Z         self,
2025-05-07T20:32:43.2186818Z         T: int,
2025-05-07T20:32:43.2186899Z         D: int,
2025-05-07T20:32:43.2187002Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2187090Z         contiguous: bool,
2025-05-07T20:32:43.2187174Z         compiled: bool,
2025-05-07T20:32:43.2187259Z     ) -> None:
2025-05-07T20:32:43.2187351Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2187428Z     
2025-05-07T20:32:43.2187593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2187665Z     
2025-05-07T20:32:43.2187760Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2187882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2187977Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2188062Z         x0 = x[:, :D]
2025-05-07T20:32:43.2188143Z         x1 = x[:, D:]
2025-05-07T20:32:43.2188215Z     
2025-05-07T20:32:43.2188302Z         if contiguous:
2025-05-07T20:32:43.2188393Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2188482Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2188566Z     
2025-05-07T20:32:43.2188657Z         if scale_ub is not None:
2025-05-07T20:32:43.2188763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2188901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2188976Z             )
2025-05-07T20:32:43.2189056Z         else:
2025-05-07T20:32:43.2189148Z             scale_ub_tensor = None
2025-05-07T20:32:43.2189220Z     
2025-05-07T20:32:43.2189351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2189440Z             op = silu_mul_quant
2025-05-07T20:32:43.2189526Z             if compiled:
2025-05-07T20:32:43.2189674Z                 op = torch.compile(op)
2025-05-07T20:32:43.2189783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2189858Z     
2025-05-07T20:32:43.2189954Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2189958Z 
2025-05-07T20:32:43.2190055Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2190186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2190291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2190390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2190756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2190847Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2191337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2191436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2191801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2192034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2192370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2192541Z     kernel = self.compile(
2025-05-07T20:32:43.2192929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2193105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2193232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2193240Z 
2025-05-07T20:32:43.2193446Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfdab9d4c0>
2025-05-07T20:32:43.2194216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2194770Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda7eda60>}
2025-05-07T20:32:43.2195526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2195724Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9f6a0b0>
2025-05-07T20:32:43.2195729Z 
2025-05-07T20:32:43.2195894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2196159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2196270Z                            module_map=module_map)
2025-05-07T20:32:43.2196436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2196542Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2196617Z E       ^
2025-05-07T20:32:43.2196974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2196981Z 
2025-05-07T20:32:43.2197405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2197409Z 
2025-05-07T20:32:43.2197509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2197729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2197812Z     T=1,
2025-05-07T20:32:43.2197887Z     D=5120,
2025-05-07T20:32:43.2197972Z     scale_ub=None,
2025-05-07T20:32:43.2198056Z     contiguous=False,
2025-05-07T20:32:43.2198137Z     compiled=True,
2025-05-07T20:32:43.2198210Z )
2025-05-07T20:32:43.2198464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2198638Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2198642Z 
2025-05-07T20:32:43.2198724Z     @given(
2025-05-07T20:32:43.2198841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2198940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2199062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2199179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2199296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2199368Z     )
2025-05-07T20:32:43.2199616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2199714Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2199790Z         self,
2025-05-07T20:32:43.2199866Z         T: int,
2025-05-07T20:32:43.2199945Z         D: int,
2025-05-07T20:32:43.2200042Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2200137Z         contiguous: bool,
2025-05-07T20:32:43.2200224Z         compiled: bool,
2025-05-07T20:32:43.2200302Z     ) -> None:
2025-05-07T20:32:43.2200397Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2200474Z     
2025-05-07T20:32:43.2200640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2200758Z     
2025-05-07T20:32:43.2200885Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2201012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2201106Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2201186Z         x0 = x[:, :D]
2025-05-07T20:32:43.2201264Z         x1 = x[:, D:]
2025-05-07T20:32:43.2201339Z     
2025-05-07T20:32:43.2201420Z         if contiguous:
2025-05-07T20:32:43.2201510Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2201601Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2201673Z     
2025-05-07T20:32:43.2201763Z         if scale_ub is not None:
2025-05-07T20:32:43.2201870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2202053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2202134Z             )
2025-05-07T20:32:43.2202211Z         else:
2025-05-07T20:32:43.2202307Z             scale_ub_tensor = None
2025-05-07T20:32:43.2202379Z     
2025-05-07T20:32:43.2202512Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2202604Z             op = silu_mul_quant
2025-05-07T20:32:43.2202692Z             if compiled:
2025-05-07T20:32:43.2202796Z                 op = torch.compile(op)
2025-05-07T20:32:43.2202900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2202977Z     
2025-05-07T20:32:43.2203068Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2203186Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2203260Z     
2025-05-07T20:32:43.2203396Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2203497Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2203609Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2203731Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2203870Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2203947Z     
2025-05-07T20:32:43.2204046Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2204056Z 
2025-05-07T20:32:43.2204157Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2204284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2204388Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2204525Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2205074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2205172Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2205604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2205832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2206205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2206468Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2206866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2207120Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2207488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2207659Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2207998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2208076Z     fn()
2025-05-07T20:32:43.2208472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2208553Z     self.fn.run(
2025-05-07T20:32:43.2208925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2209058Z     kernel = self.compile(
2025-05-07T20:32:43.2209430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2209609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2209734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2209739Z 
2025-05-07T20:32:43.2209945Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9fc2490>
2025-05-07T20:32:43.2210718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2211279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda1b9430>}
2025-05-07T20:32:43.2212033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2212224Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9918630>
2025-05-07T20:32:43.2212229Z 
2025-05-07T20:32:43.2212393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2212659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2212772Z                            module_map=module_map)
2025-05-07T20:32:43.2212937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2213041Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2213118Z E       ^
2025-05-07T20:32:43.2213472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2213479Z 
2025-05-07T20:32:43.2213891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2213895Z 
2025-05-07T20:32:43.2214000Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2214224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2214298Z     T=1,
2025-05-07T20:32:43.2214376Z     D=5120,
2025-05-07T20:32:43.2214456Z     scale_ub=None,
2025-05-07T20:32:43.2214541Z     contiguous=True,
2025-05-07T20:32:43.2214627Z     compiled=False,
2025-05-07T20:32:43.2214743Z )
2025-05-07T20:32:43.2214960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2215130Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2215134Z 
2025-05-07T20:32:43.2215210Z     @given(
2025-05-07T20:32:43.2215339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2215435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2215553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2215677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2215792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2215866Z     )
2025-05-07T20:32:43.2216118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2216212Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2216291Z         self,
2025-05-07T20:32:43.2216371Z         T: int,
2025-05-07T20:32:43.2216448Z         D: int,
2025-05-07T20:32:43.2216547Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2216640Z         contiguous: bool,
2025-05-07T20:32:43.2216726Z         compiled: bool,
2025-05-07T20:32:43.2216806Z     ) -> None:
2025-05-07T20:32:43.2216898Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2216969Z     
2025-05-07T20:32:43.2217216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2217292Z     
2025-05-07T20:32:43.2217383Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2217510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2217597Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2217676Z         x0 = x[:, :D]
2025-05-07T20:32:43.2217758Z         x1 = x[:, D:]
2025-05-07T20:32:43.2217831Z     
2025-05-07T20:32:43.2217916Z         if contiguous:
2025-05-07T20:32:43.2218010Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2218098Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2218172Z     
2025-05-07T20:32:43.2218308Z         if scale_ub is not None:
2025-05-07T20:32:43.2218413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2218552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2218628Z             )
2025-05-07T20:32:43.2218705Z         else:
2025-05-07T20:32:43.2218800Z             scale_ub_tensor = None
2025-05-07T20:32:43.2218880Z     
2025-05-07T20:32:43.2219009Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2219102Z             op = silu_mul_quant
2025-05-07T20:32:43.2219189Z             if compiled:
2025-05-07T20:32:43.2219287Z                 op = torch.compile(op)
2025-05-07T20:32:43.2219399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2219470Z     
2025-05-07T20:32:43.2219567Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2219572Z 
2025-05-07T20:32:43.2219669Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2219795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2219907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2220007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2220509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2220615Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2220974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2221250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2221585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2221678Z     kernel = self.compile(
2025-05-07T20:32:43.2222064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2222281Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2222409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2222417Z 
2025-05-07T20:32:43.2222626Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd99cafd0>
2025-05-07T20:32:43.2223395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2223915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfda1b9e50>}
2025-05-07T20:32:43.2224663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2224862Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9e1c630>
2025-05-07T20:32:43.2224867Z 
2025-05-07T20:32:43.2225034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2225300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2225488Z                            module_map=module_map)
2025-05-07T20:32:43.2225651Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2225749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2225829Z E       ^
2025-05-07T20:32:43.2226189Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2226194Z 
2025-05-07T20:32:43.2226606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2226610Z 
2025-05-07T20:32:43.2226713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2226980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2227059Z     T=128,
2025-05-07T20:32:43.2227133Z     D=5120,
2025-05-07T20:32:43.2227217Z     scale_ub=None,
2025-05-07T20:32:43.2227304Z     contiguous=False,
2025-05-07T20:32:43.2227386Z     compiled=True,
2025-05-07T20:32:43.2227465Z )
2025-05-07T20:32:43.2227682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2227851Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2227855Z 
2025-05-07T20:32:43.2227936Z     @given(
2025-05-07T20:32:43.2228052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2228148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2228266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2228382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2228498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2228579Z     )
2025-05-07T20:32:43.2228823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2228919Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2228994Z         self,
2025-05-07T20:32:43.2229072Z         T: int,
2025-05-07T20:32:43.2229153Z         D: int,
2025-05-07T20:32:43.2229253Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2229342Z         contiguous: bool,
2025-05-07T20:32:43.2229432Z         compiled: bool,
2025-05-07T20:32:43.2229509Z     ) -> None:
2025-05-07T20:32:43.2229603Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2229678Z     
2025-05-07T20:32:43.2229845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2229918Z     
2025-05-07T20:32:43.2230012Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2230136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2230229Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2230354Z         x0 = x[:, :D]
2025-05-07T20:32:43.2230436Z         x1 = x[:, D:]
2025-05-07T20:32:43.2230512Z     
2025-05-07T20:32:43.2230595Z         if contiguous:
2025-05-07T20:32:43.2230687Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2230782Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2230853Z     
2025-05-07T20:32:43.2230948Z         if scale_ub is not None:
2025-05-07T20:32:43.2231058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2231192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2231268Z             )
2025-05-07T20:32:43.2231350Z         else:
2025-05-07T20:32:43.2231442Z             scale_ub_tensor = None
2025-05-07T20:32:43.2231516Z     
2025-05-07T20:32:43.2231643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2231733Z             op = silu_mul_quant
2025-05-07T20:32:43.2231823Z             if compiled:
2025-05-07T20:32:43.2231922Z                 op = torch.compile(op)
2025-05-07T20:32:43.2232037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2232114Z     
2025-05-07T20:32:43.2232205Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2232209Z 
2025-05-07T20:32:43.2232305Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2232436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2232613Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2232717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2233083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2233174Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2233667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2233763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2234130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2234422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2234763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2234856Z     kernel = self.compile(
2025-05-07T20:32:43.2235237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2235415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2235544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2235549Z 
2025-05-07T20:32:43.2235757Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd99b20d0>
2025-05-07T20:32:43.2236531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2237045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98b8040>}
2025-05-07T20:32:43.2237794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2237995Z context = <triton._C.libtriton.ir.context object at 0x7fbfd98feab0>
2025-05-07T20:32:43.2237999Z 
2025-05-07T20:32:43.2238162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2238424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2238531Z                            module_map=module_map)
2025-05-07T20:32:43.2238690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2238831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2238908Z E       ^
2025-05-07T20:32:43.2239256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2239263Z 
2025-05-07T20:32:43.2239680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2239687Z 
2025-05-07T20:32:43.2239788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2240015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2240320Z     T=128,
2025-05-07T20:32:43.2240434Z     D=7168,
2025-05-07T20:32:43.2240543Z     scale_ub=1200.0,
2025-05-07T20:32:43.2240631Z     contiguous=False,
2025-05-07T20:32:43.2240714Z     compiled=False,
2025-05-07T20:32:43.2240791Z )
2025-05-07T20:32:43.2241007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2241190Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2241195Z 
2025-05-07T20:32:43.2241272Z     @given(
2025-05-07T20:32:43.2241388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2241490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2241742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2241858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2241976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2242055Z     )
2025-05-07T20:32:43.2242307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2242401Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2242476Z         self,
2025-05-07T20:32:43.2242557Z         T: int,
2025-05-07T20:32:43.2242633Z         D: int,
2025-05-07T20:32:43.2242730Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2242822Z         contiguous: bool,
2025-05-07T20:32:43.2242969Z         compiled: bool,
2025-05-07T20:32:43.2243047Z     ) -> None:
2025-05-07T20:32:43.2243146Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2243219Z     
2025-05-07T20:32:43.2243387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2243465Z     
2025-05-07T20:32:43.2243562Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2243687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2243781Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2243862Z         x0 = x[:, :D]
2025-05-07T20:32:43.2243949Z         x1 = x[:, D:]
2025-05-07T20:32:43.2244022Z     
2025-05-07T20:32:43.2244105Z         if contiguous:
2025-05-07T20:32:43.2244199Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2244290Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2244363Z     
2025-05-07T20:32:43.2244457Z         if scale_ub is not None:
2025-05-07T20:32:43.2244561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2244700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2244783Z             )
2025-05-07T20:32:43.2244859Z         else:
2025-05-07T20:32:43.2244952Z             scale_ub_tensor = None
2025-05-07T20:32:43.2245031Z     
2025-05-07T20:32:43.2245159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2245257Z             op = silu_mul_quant
2025-05-07T20:32:43.2245342Z             if compiled:
2025-05-07T20:32:43.2245443Z                 op = torch.compile(op)
2025-05-07T20:32:43.2245550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2245624Z     
2025-05-07T20:32:43.2245714Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2245719Z 
2025-05-07T20:32:43.2245820Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2245947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2246049Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2246154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2246724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2246826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2247183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2247415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2247756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2247849Z     kernel = self.compile(
2025-05-07T20:32:43.2248228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2248406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2248531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2248544Z 
2025-05-07T20:32:43.2248758Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9f10eb0>
2025-05-07T20:32:43.2249560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2250110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98b8c10>}
2025-05-07T20:32:43.2250879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2251093Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9860bf0>
2025-05-07T20:32:43.2251098Z 
2025-05-07T20:32:43.2251270Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2251577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2251690Z                            module_map=module_map)
2025-05-07T20:32:43.2251852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2251958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2252037Z E       ^
2025-05-07T20:32:43.2252387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2252391Z 
2025-05-07T20:32:43.2252799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2252805Z 
2025-05-07T20:32:43.2252910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2253135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2253216Z     T=128,
2025-05-07T20:32:43.2253296Z     D=5120,
2025-05-07T20:32:43.2253377Z     scale_ub=None,
2025-05-07T20:32:43.2253466Z     contiguous=False,
2025-05-07T20:32:43.2253551Z     compiled=False,
2025-05-07T20:32:43.2253623Z )
2025-05-07T20:32:43.2253841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2254022Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2254027Z 
2025-05-07T20:32:43.2254107Z     @given(
2025-05-07T20:32:43.2254226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2254325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2254443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2254561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2254676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2254757Z     )
2025-05-07T20:32:43.2255002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2255144Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2255228Z         self,
2025-05-07T20:32:43.2255305Z         T: int,
2025-05-07T20:32:43.2255382Z         D: int,
2025-05-07T20:32:43.2255482Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2255571Z         contiguous: bool,
2025-05-07T20:32:43.2255667Z         compiled: bool,
2025-05-07T20:32:43.2255746Z     ) -> None:
2025-05-07T20:32:43.2255842Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2255920Z     
2025-05-07T20:32:43.2256088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2256163Z     
2025-05-07T20:32:43.2256259Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2256385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2256473Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2256558Z         x0 = x[:, :D]
2025-05-07T20:32:43.2256637Z         x1 = x[:, D:]
2025-05-07T20:32:43.2256709Z     
2025-05-07T20:32:43.2256797Z         if contiguous:
2025-05-07T20:32:43.2256892Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2256983Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2257057Z     
2025-05-07T20:32:43.2257148Z         if scale_ub is not None:
2025-05-07T20:32:43.2257254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2257469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2257545Z             )
2025-05-07T20:32:43.2257624Z         else:
2025-05-07T20:32:43.2257718Z             scale_ub_tensor = None
2025-05-07T20:32:43.2257790Z     
2025-05-07T20:32:43.2257922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2258012Z             op = silu_mul_quant
2025-05-07T20:32:43.2258097Z             if compiled:
2025-05-07T20:32:43.2258203Z                 op = torch.compile(op)
2025-05-07T20:32:43.2258307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2258379Z     
2025-05-07T20:32:43.2258474Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2258521Z 
2025-05-07T20:32:43.2258619Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2258747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2258849Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2258948Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2259451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2259548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2259909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2260133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2260469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2260568Z     kernel = self.compile(
2025-05-07T20:32:43.2260958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2261229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2261358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2261365Z 
2025-05-07T20:32:43.2261575Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd98d6e80>
2025-05-07T20:32:43.2262342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2262857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9bf0310>}
2025-05-07T20:32:43.2263642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2263842Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9bded30>
2025-05-07T20:32:43.2263847Z 
2025-05-07T20:32:43.2264015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2264289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2264396Z                            module_map=module_map)
2025-05-07T20:32:43.2264558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2264666Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2264743Z E       ^
2025-05-07T20:32:43.2265095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2265099Z 
2025-05-07T20:32:43.2265510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2265517Z 
2025-05-07T20:32:43.2265620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2265846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2265924Z     T=128,
2025-05-07T20:32:43.2266103Z     D=5120,
2025-05-07T20:32:43.2270305Z     scale_ub=1200.0,
2025-05-07T20:32:43.2270408Z     contiguous=True,
2025-05-07T20:32:43.2270495Z     compiled=False,
2025-05-07T20:32:43.2270573Z )
2025-05-07T20:32:43.2270797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2270973Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2270978Z 
2025-05-07T20:32:43.2271058Z     @given(
2025-05-07T20:32:43.2271176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2271280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2271471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2271590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2271707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2271782Z     )
2025-05-07T20:32:43.2272027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2272133Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2272210Z         self,
2025-05-07T20:32:43.2272287Z         T: int,
2025-05-07T20:32:43.2272373Z         D: int,
2025-05-07T20:32:43.2272472Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2272563Z         contiguous: bool,
2025-05-07T20:32:43.2272655Z         compiled: bool,
2025-05-07T20:32:43.2272733Z     ) -> None:
2025-05-07T20:32:43.2272829Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2272905Z     
2025-05-07T20:32:43.2273074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2273150Z     
2025-05-07T20:32:43.2273248Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2273376Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2273468Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2273549Z         x0 = x[:, :D]
2025-05-07T20:32:43.2273627Z         x1 = x[:, D:]
2025-05-07T20:32:43.2273702Z     
2025-05-07T20:32:43.2273788Z         if contiguous:
2025-05-07T20:32:43.2273883Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2273980Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2274051Z     
2025-05-07T20:32:43.2274141Z         if scale_ub is not None:
2025-05-07T20:32:43.2274253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2274387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2274472Z             )
2025-05-07T20:32:43.2274548Z         else:
2025-05-07T20:32:43.2274643Z             scale_ub_tensor = None
2025-05-07T20:32:43.2274718Z     
2025-05-07T20:32:43.2274849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2274992Z             op = silu_mul_quant
2025-05-07T20:32:43.2275083Z             if compiled:
2025-05-07T20:32:43.2275183Z                 op = torch.compile(op)
2025-05-07T20:32:43.2275288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2275364Z     
2025-05-07T20:32:43.2275456Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2275465Z 
2025-05-07T20:32:43.2275564Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2275693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2275794Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2275900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2276403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2276500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2276868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2277102Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2277444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2277542Z     kernel = self.compile(
2025-05-07T20:32:43.2278003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2278186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2278313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2278318Z 
2025-05-07T20:32:43.2278526Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9bf3ac0>
2025-05-07T20:32:43.2279307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2279860Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9bf0ee0>}
2025-05-07T20:32:43.2280619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2280839Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9ce90f0>
2025-05-07T20:32:43.2280843Z 
2025-05-07T20:32:43.2281038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2281305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2281412Z                            module_map=module_map)
2025-05-07T20:32:43.2281579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2281682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2281759Z E       ^
2025-05-07T20:32:43.2282120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2282125Z 
2025-05-07T20:32:43.2282542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2282549Z 
2025-05-07T20:32:43.2282653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2282881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2282959Z     T=1,
2025-05-07T20:32:43.2283037Z     D=7168,
2025-05-07T20:32:43.2283119Z     scale_ub=1200.0,
2025-05-07T20:32:43.2283202Z     contiguous=True,
2025-05-07T20:32:43.2283290Z     compiled=True,
2025-05-07T20:32:43.2283364Z )
2025-05-07T20:32:43.2283581Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2283791Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2283795Z 
2025-05-07T20:32:43.2283874Z     @given(
2025-05-07T20:32:43.2283997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2284098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2284220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2284343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2284455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2284529Z     )
2025-05-07T20:32:43.2284782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2284877Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2284957Z         self,
2025-05-07T20:32:43.2285034Z         T: int,
2025-05-07T20:32:43.2285111Z         D: int,
2025-05-07T20:32:43.2285213Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2285302Z         contiguous: bool,
2025-05-07T20:32:43.2285392Z         compiled: bool,
2025-05-07T20:32:43.2285473Z     ) -> None:
2025-05-07T20:32:43.2285571Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2285645Z     
2025-05-07T20:32:43.2285821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2285895Z     
2025-05-07T20:32:43.2286029Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2286194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2286285Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2286370Z         x0 = x[:, :D]
2025-05-07T20:32:43.2286458Z         x1 = x[:, D:]
2025-05-07T20:32:43.2286529Z     
2025-05-07T20:32:43.2286618Z         if contiguous:
2025-05-07T20:32:43.2286712Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2286801Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2286875Z     
2025-05-07T20:32:43.2286965Z         if scale_ub is not None:
2025-05-07T20:32:43.2287074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2287215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2287332Z             )
2025-05-07T20:32:43.2287409Z         else:
2025-05-07T20:32:43.2287506Z             scale_ub_tensor = None
2025-05-07T20:32:43.2287579Z     
2025-05-07T20:32:43.2287709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2287809Z             op = silu_mul_quant
2025-05-07T20:32:43.2287895Z             if compiled:
2025-05-07T20:32:43.2287998Z                 op = torch.compile(op)
2025-05-07T20:32:43.2288104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2288178Z     
2025-05-07T20:32:43.2288274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2288278Z 
2025-05-07T20:32:43.2288377Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2288504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2288607Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2288706Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2289082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2289178Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2289667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2289772Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2290135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2290364Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2290706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2290799Z     kernel = self.compile(
2025-05-07T20:32:43.2291189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2291415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2291544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2291548Z 
2025-05-07T20:32:43.2291758Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9c53280>
2025-05-07T20:32:43.2292540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2293056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfdae2c940>}
2025-05-07T20:32:43.2293793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2293991Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9e795f0>
2025-05-07T20:32:43.2293996Z 
2025-05-07T20:32:43.2294168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2294473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2294620Z                            module_map=module_map)
2025-05-07T20:32:43.2294781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2294879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2294960Z E       ^
2025-05-07T20:32:43.2295313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2295317Z 
2025-05-07T20:32:43.2295732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2295741Z 
2025-05-07T20:32:43.2295886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2296108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2296189Z     T=1,
2025-05-07T20:32:43.2296267Z     D=7168,
2025-05-07T20:32:43.2296349Z     scale_ub=1200.0,
2025-05-07T20:32:43.2296443Z     contiguous=False,
2025-05-07T20:32:43.2296533Z     compiled=True,
2025-05-07T20:32:43.2296606Z )
2025-05-07T20:32:43.2296826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2296993Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2296998Z 
2025-05-07T20:32:43.2297076Z     @given(
2025-05-07T20:32:43.2297194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2297291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2297408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2297524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2297642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2297720Z     )
2025-05-07T20:32:43.2297966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2298061Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2298139Z         self,
2025-05-07T20:32:43.2298218Z         T: int,
2025-05-07T20:32:43.2298296Z         D: int,
2025-05-07T20:32:43.2298401Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2298489Z         contiguous: bool,
2025-05-07T20:32:43.2298580Z         compiled: bool,
2025-05-07T20:32:43.2298657Z     ) -> None:
2025-05-07T20:32:43.2298753Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2298829Z     
2025-05-07T20:32:43.2298999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2299077Z     
2025-05-07T20:32:43.2299173Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2299297Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2299388Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2299518Z         x0 = x[:, :D]
2025-05-07T20:32:43.2299601Z         x1 = x[:, D:]
2025-05-07T20:32:43.2299675Z     
2025-05-07T20:32:43.2299760Z         if contiguous:
2025-05-07T20:32:43.2299851Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2299945Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2300021Z     
2025-05-07T20:32:43.2300113Z         if scale_ub is not None:
2025-05-07T20:32:43.2300221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2300358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2300434Z             )
2025-05-07T20:32:43.2300518Z         else:
2025-05-07T20:32:43.2300613Z             scale_ub_tensor = None
2025-05-07T20:32:43.2300684Z     
2025-05-07T20:32:43.2300817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2300906Z             op = silu_mul_quant
2025-05-07T20:32:43.2300991Z             if compiled:
2025-05-07T20:32:43.2301168Z                 op = torch.compile(op)
2025-05-07T20:32:43.2301280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2301356Z     
2025-05-07T20:32:43.2301447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2301452Z 
2025-05-07T20:32:43.2301547Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2301744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2301882Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2301985Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2302350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2302445Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2302934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2303034Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2303396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2303662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2304000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2304100Z     kernel = self.compile(
2025-05-07T20:32:43.2304479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2304660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2304786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2304791Z 
2025-05-07T20:32:43.2304996Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9e5a8b0>
2025-05-07T20:32:43.2305767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2306282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ba55e0>}
2025-05-07T20:32:43.2307038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2307230Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9a87a30>
2025-05-07T20:32:43.2307235Z 
2025-05-07T20:32:43.2307403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2307665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2307776Z                            module_map=module_map)
2025-05-07T20:32:43.2307980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2308085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2308165Z E       ^
2025-05-07T20:32:43.2308521Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2308525Z 
2025-05-07T20:32:43.2308941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2308945Z 
2025-05-07T20:32:43.2309051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2309274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2309350Z     T=1,
2025-05-07T20:32:43.2309428Z     D=7168,
2025-05-07T20:32:43.2309510Z     scale_ub=None,
2025-05-07T20:32:43.2309596Z     contiguous=False,
2025-05-07T20:32:43.2309682Z     compiled=True,
2025-05-07T20:32:43.2309756Z )
2025-05-07T20:32:43.2309980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2310147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2310152Z 
2025-05-07T20:32:43.2310229Z     @given(
2025-05-07T20:32:43.2310356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2310455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2310651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2310772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2310886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2310960Z     )
2025-05-07T20:32:43.2311211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2311304Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2311383Z         self,
2025-05-07T20:32:43.2311461Z         T: int,
2025-05-07T20:32:43.2311537Z         D: int,
2025-05-07T20:32:43.2311637Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2311727Z         contiguous: bool,
2025-05-07T20:32:43.2311856Z         compiled: bool,
2025-05-07T20:32:43.2311939Z     ) -> None:
2025-05-07T20:32:43.2312033Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2312105Z     
2025-05-07T20:32:43.2312278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2312355Z     
2025-05-07T20:32:43.2312450Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2312579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2312669Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2312754Z         x0 = x[:, :D]
2025-05-07T20:32:43.2312833Z         x1 = x[:, D:]
2025-05-07T20:32:43.2312905Z     
2025-05-07T20:32:43.2312991Z         if contiguous:
2025-05-07T20:32:43.2313081Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2313171Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2313244Z     
2025-05-07T20:32:43.2313336Z         if scale_ub is not None:
2025-05-07T20:32:43.2313440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2313585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2313660Z             )
2025-05-07T20:32:43.2313738Z         else:
2025-05-07T20:32:43.2313835Z             scale_ub_tensor = None
2025-05-07T20:32:43.2313907Z     
2025-05-07T20:32:43.2314040Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2314137Z             op = silu_mul_quant
2025-05-07T20:32:43.2314222Z             if compiled:
2025-05-07T20:32:43.2314324Z                 op = torch.compile(op)
2025-05-07T20:32:43.2314431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2314505Z     
2025-05-07T20:32:43.2314599Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2314720Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2314791Z     
2025-05-07T20:32:43.2314930Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2315032Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2315178Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2315306Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2315451Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2315529Z     
2025-05-07T20:32:43.2315629Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2315635Z 
2025-05-07T20:32:43.2315737Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2315867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2315971Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2316106Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2316662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2316762Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2317129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2317358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2317722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2318023Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2318456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.2318711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2319081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2319248Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2319593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2319709Z     fn()
2025-05-07T20:32:43.2320111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2320197Z     self.fn.run(
2025-05-07T20:32:43.2320540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2320641Z     kernel = self.compile(
2025-05-07T20:32:43.2321014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2321192Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2321320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2321324Z 
2025-05-07T20:32:43.2321532Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a569d0>
2025-05-07T20:32:43.2322303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2322822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd998c160>}
2025-05-07T20:32:43.2323559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2323755Z context = <triton._C.libtriton.ir.context object at 0x7fbfd996aa30>
2025-05-07T20:32:43.2323759Z 
2025-05-07T20:32:43.2323924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2324194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2324345Z                            module_map=module_map)
2025-05-07T20:32:43.2324510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2324615Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2324691Z E       ^
2025-05-07T20:32:43.2325046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2325057Z 
2025-05-07T20:32:43.2325467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2325471Z 
2025-05-07T20:32:43.2325573Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2325799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2325875Z     T=1,
2025-05-07T20:32:43.2325951Z     D=5120,
2025-05-07T20:32:43.2326037Z     scale_ub=1200.0,
2025-05-07T20:32:43.2326124Z     contiguous=False,
2025-05-07T20:32:43.2326207Z     compiled=True,
2025-05-07T20:32:43.2326291Z )
2025-05-07T20:32:43.2326510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2326679Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2326683Z 
2025-05-07T20:32:43.2326760Z     @given(
2025-05-07T20:32:43.2326919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2327058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2327175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2327296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2327413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2327487Z     )
2025-05-07T20:32:43.2327737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2327834Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2327910Z         self,
2025-05-07T20:32:43.2327991Z         T: int,
2025-05-07T20:32:43.2328068Z         D: int,
2025-05-07T20:32:43.2328210Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2328304Z         contiguous: bool,
2025-05-07T20:32:43.2328389Z         compiled: bool,
2025-05-07T20:32:43.2328467Z     ) -> None:
2025-05-07T20:32:43.2328567Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2328643Z     
2025-05-07T20:32:43.2328816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2328896Z     
2025-05-07T20:32:43.2328989Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2329116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2329208Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2329290Z         x0 = x[:, :D]
2025-05-07T20:32:43.2329369Z         x1 = x[:, D:]
2025-05-07T20:32:43.2329444Z     
2025-05-07T20:32:43.2329527Z         if contiguous:
2025-05-07T20:32:43.2329618Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2329710Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2329785Z     
2025-05-07T20:32:43.2329880Z         if scale_ub is not None:
2025-05-07T20:32:43.2329988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2330125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2330203Z             )
2025-05-07T20:32:43.2330279Z         else:
2025-05-07T20:32:43.2330373Z             scale_ub_tensor = None
2025-05-07T20:32:43.2330454Z     
2025-05-07T20:32:43.2330584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2330674Z             op = silu_mul_quant
2025-05-07T20:32:43.2330764Z             if compiled:
2025-05-07T20:32:43.2330863Z                 op = torch.compile(op)
2025-05-07T20:32:43.2330968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2331043Z     
2025-05-07T20:32:43.2331134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2331139Z 
2025-05-07T20:32:43.2331235Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2331367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2331536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2331646Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2332008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2332100Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2332605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2332702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2333063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2333294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2333629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2333724Z     kernel = self.compile(
2025-05-07T20:32:43.2334107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2334283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2334412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2334457Z 
2025-05-07T20:32:43.2334701Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a32310>
2025-05-07T20:32:43.2335487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2335988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd998cb80>}
2025-05-07T20:32:43.2336727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2336963Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9a2fbb0>
2025-05-07T20:32:43.2336967Z 
2025-05-07T20:32:43.2337135Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2337408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2337516Z                            module_map=module_map)
2025-05-07T20:32:43.2337678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2337777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2337853Z E       ^
2025-05-07T20:32:43.2338215Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2338220Z 
2025-05-07T20:32:43.2338633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2338639Z 
2025-05-07T20:32:43.2338739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2338964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2339040Z     T=1,
2025-05-07T20:32:43.2339121Z     D=5120,
2025-05-07T20:32:43.2339206Z     scale_ub=1200.0,
2025-05-07T20:32:43.2339291Z     contiguous=False,
2025-05-07T20:32:43.2339378Z     compiled=False,
2025-05-07T20:32:43.2339451Z )
2025-05-07T20:32:43.2339665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2339838Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2339842Z 
2025-05-07T20:32:43.2339918Z     @given(
2025-05-07T20:32:43.2340036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2340369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2340614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2340741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2340858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2340933Z     )
2025-05-07T20:32:43.2341229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2341329Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2341404Z         self,
2025-05-07T20:32:43.2341483Z         T: int,
2025-05-07T20:32:43.2341558Z         D: int,
2025-05-07T20:32:43.2341659Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2341749Z         contiguous: bool,
2025-05-07T20:32:43.2341833Z         compiled: bool,
2025-05-07T20:32:43.2341910Z     ) -> None:
2025-05-07T20:32:43.2342006Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2342077Z     
2025-05-07T20:32:43.2342245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2342321Z     
2025-05-07T20:32:43.2342414Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2342542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2342630Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2342709Z         x0 = x[:, :D]
2025-05-07T20:32:43.2342793Z         x1 = x[:, D:]
2025-05-07T20:32:43.2342864Z     
2025-05-07T20:32:43.2343011Z         if contiguous:
2025-05-07T20:32:43.2343153Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2343244Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2343316Z     
2025-05-07T20:32:43.2343410Z         if scale_ub is not None:
2025-05-07T20:32:43.2343515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2343652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2343730Z             )
2025-05-07T20:32:43.2343806Z         else:
2025-05-07T20:32:43.2343902Z             scale_ub_tensor = None
2025-05-07T20:32:43.2343974Z     
2025-05-07T20:32:43.2344103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2344266Z             op = silu_mul_quant
2025-05-07T20:32:43.2344352Z             if compiled:
2025-05-07T20:32:43.2344451Z                 op = torch.compile(op)
2025-05-07T20:32:43.2344559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2344632Z     
2025-05-07T20:32:43.2344721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2344730Z 
2025-05-07T20:32:43.2344830Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2344957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2345057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2345162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2345656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2345754Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2346109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2346340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2346682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2346775Z     kernel = self.compile(
2025-05-07T20:32:43.2347167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2347344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2347469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2347473Z 
2025-05-07T20:32:43.2347684Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9a200a0>
2025-05-07T20:32:43.2348509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2349028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9a30550>}
2025-05-07T20:32:43.2349787Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2349984Z context = <triton._C.libtriton.ir.context object at 0x7fbfd95413b0>
2025-05-07T20:32:43.2349989Z 
2025-05-07T20:32:43.2350158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2350420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2350532Z                            module_map=module_map)
2025-05-07T20:32:43.2350693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2350796Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2350879Z E       ^
2025-05-07T20:32:43.2351238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2351243Z 
2025-05-07T20:32:43.2351703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2351742Z 
2025-05-07T20:32:43.2351845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2352066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2352145Z     T=16384,
2025-05-07T20:32:43.2352222Z     D=5120,
2025-05-07T20:32:43.2352305Z     scale_ub=1200.0,
2025-05-07T20:32:43.2352396Z     contiguous=False,
2025-05-07T20:32:43.2352482Z     compiled=True,
2025-05-07T20:32:43.2352554Z )
2025-05-07T20:32:43.2352775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2352999Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2353004Z 
2025-05-07T20:32:43.2353083Z     @given(
2025-05-07T20:32:43.2353200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2353298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2353423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2353538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2353652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2353730Z     )
2025-05-07T20:32:43.2353979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2354072Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2354150Z         self,
2025-05-07T20:32:43.2354226Z         T: int,
2025-05-07T20:32:43.2354305Z         D: int,
2025-05-07T20:32:43.2354403Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2354491Z         contiguous: bool,
2025-05-07T20:32:43.2354585Z         compiled: bool,
2025-05-07T20:32:43.2354664Z     ) -> None:
2025-05-07T20:32:43.2354759Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2354836Z     
2025-05-07T20:32:43.2355007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2355080Z     
2025-05-07T20:32:43.2355182Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2355306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2355396Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2355480Z         x0 = x[:, :D]
2025-05-07T20:32:43.2355560Z         x1 = x[:, D:]
2025-05-07T20:32:43.2355633Z     
2025-05-07T20:32:43.2355718Z         if contiguous:
2025-05-07T20:32:43.2355808Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2355901Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2355973Z     
2025-05-07T20:32:43.2356063Z         if scale_ub is not None:
2025-05-07T20:32:43.2356172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2356357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2356435Z             )
2025-05-07T20:32:43.2356514Z         else:
2025-05-07T20:32:43.2356607Z             scale_ub_tensor = None
2025-05-07T20:32:43.2356680Z     
2025-05-07T20:32:43.2356815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2356911Z             op = silu_mul_quant
2025-05-07T20:32:43.2356998Z             if compiled:
2025-05-07T20:32:43.2357103Z                 op = torch.compile(op)
2025-05-07T20:32:43.2357208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2357282Z     
2025-05-07T20:32:43.2357373Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2357377Z 
2025-05-07T20:32:43.2357473Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2357601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2357701Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2357800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2358173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2358266Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2358759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2358939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2359302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2359527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2359864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2359958Z     kernel = self.compile(
2025-05-07T20:32:43.2360343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2360583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2360713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2360718Z 
2025-05-07T20:32:43.2360923Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9522b50>
2025-05-07T20:32:43.2361710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2362230Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ae41f0>}
2025-05-07T20:32:43.2362968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2363164Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9af24b0>
2025-05-07T20:32:43.2363169Z 
2025-05-07T20:32:43.2363334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2363601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2363711Z                            module_map=module_map)
2025-05-07T20:32:43.2363873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2363975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2364052Z E       ^
2025-05-07T20:32:43.2364407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2364412Z 
2025-05-07T20:32:43.2364824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2364828Z 
2025-05-07T20:32:43.2364976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2365202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2365279Z     T=2048,
2025-05-07T20:32:43.2365354Z     D=7168,
2025-05-07T20:32:43.2365441Z     scale_ub=1200.0,
2025-05-07T20:32:43.2365526Z     contiguous=False,
2025-05-07T20:32:43.2365613Z     compiled=True,
2025-05-07T20:32:43.2365689Z )
2025-05-07T20:32:43.2365909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2366081Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2366085Z 
2025-05-07T20:32:43.2366164Z     @given(
2025-05-07T20:32:43.2366281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2366382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2366496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2366613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2366734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2366808Z     )
2025-05-07T20:32:43.2367057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2367155Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2367232Z         self,
2025-05-07T20:32:43.2367350Z         T: int,
2025-05-07T20:32:43.2367465Z         D: int,
2025-05-07T20:32:43.2367564Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2367653Z         contiguous: bool,
2025-05-07T20:32:43.2367743Z         compiled: bool,
2025-05-07T20:32:43.2367820Z     ) -> None:
2025-05-07T20:32:43.2367919Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2367991Z     
2025-05-07T20:32:43.2368160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2368238Z     
2025-05-07T20:32:43.2368329Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2368455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2368592Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2368672Z         x0 = x[:, :D]
2025-05-07T20:32:43.2368752Z         x1 = x[:, D:]
2025-05-07T20:32:43.2368826Z     
2025-05-07T20:32:43.2368910Z         if contiguous:
2025-05-07T20:32:43.2369001Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2369092Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2369169Z     
2025-05-07T20:32:43.2369266Z         if scale_ub is not None:
2025-05-07T20:32:43.2369373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2369509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2369590Z             )
2025-05-07T20:32:43.2369667Z         else:
2025-05-07T20:32:43.2369760Z             scale_ub_tensor = None
2025-05-07T20:32:43.2369834Z     
2025-05-07T20:32:43.2369964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2370056Z             op = silu_mul_quant
2025-05-07T20:32:43.2370144Z             if compiled:
2025-05-07T20:32:43.2370248Z                 op = torch.compile(op)
2025-05-07T20:32:43.2370356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2370432Z     
2025-05-07T20:32:43.2370523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2370527Z 
2025-05-07T20:32:43.2370627Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2370756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2370862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2370965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2371335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2371426Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2371920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2372017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2372417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2372649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2372989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2373092Z     kernel = self.compile(
2025-05-07T20:32:43.2373474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2373651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2373778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2373783Z 
2025-05-07T20:32:43.2373991Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9afbd60>
2025-05-07T20:32:43.2374764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2375280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9ae4ee0>}
2025-05-07T20:32:43.2376110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2376306Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9823370>
2025-05-07T20:32:43.2376310Z 
2025-05-07T20:32:43.2376477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2376741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2376847Z                            module_map=module_map)
2025-05-07T20:32:43.2377053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2377154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2377231Z E       ^
2025-05-07T20:32:43.2377588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2377596Z 
2025-05-07T20:32:43.2378013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2378017Z 
2025-05-07T20:32:43.2378119Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2378347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2378422Z     T=1,
2025-05-07T20:32:43.2378500Z     D=5120,
2025-05-07T20:32:43.2378581Z     scale_ub=None,
2025-05-07T20:32:43.2378668Z     contiguous=False,
2025-05-07T20:32:43.2378754Z     compiled=False,
2025-05-07T20:32:43.2378828Z )
2025-05-07T20:32:43.2379045Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2379219Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2379224Z 
2025-05-07T20:32:43.2379301Z     @given(
2025-05-07T20:32:43.2379420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2379523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2379642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2379761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2379875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2379951Z     )
2025-05-07T20:32:43.2380198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2380293Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2380368Z         self,
2025-05-07T20:32:43.2380448Z         T: int,
2025-05-07T20:32:43.2380525Z         D: int,
2025-05-07T20:32:43.2380625Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2380763Z         contiguous: bool,
2025-05-07T20:32:43.2380872Z         compiled: bool,
2025-05-07T20:32:43.2380954Z     ) -> None:
2025-05-07T20:32:43.2381123Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2381197Z     
2025-05-07T20:32:43.2381370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2381448Z     
2025-05-07T20:32:43.2381541Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2381668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2381757Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2381836Z         x0 = x[:, :D]
2025-05-07T20:32:43.2381919Z         x1 = x[:, D:]
2025-05-07T20:32:43.2381991Z     
2025-05-07T20:32:43.2382073Z         if contiguous:
2025-05-07T20:32:43.2382167Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2382257Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2382328Z     
2025-05-07T20:32:43.2382421Z         if scale_ub is not None:
2025-05-07T20:32:43.2382531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2382672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2382747Z             )
2025-05-07T20:32:43.2382825Z         else:
2025-05-07T20:32:43.2382922Z             scale_ub_tensor = None
2025-05-07T20:32:43.2382995Z     
2025-05-07T20:32:43.2383168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2383297Z             op = silu_mul_quant
2025-05-07T20:32:43.2383382Z             if compiled:
2025-05-07T20:32:43.2383481Z                 op = torch.compile(op)
2025-05-07T20:32:43.2383587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2383659Z     
2025-05-07T20:32:43.2383749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2383753Z 
2025-05-07T20:32:43.2383854Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2383979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2384081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2384223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2384722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2384822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2385181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2385411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2385749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2385842Z     kernel = self.compile(
2025-05-07T20:32:43.2386222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2386396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2386523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2386530Z 
2025-05-07T20:32:43.2386737Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9846670>
2025-05-07T20:32:43.2387502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2388020Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd98475e0>}
2025-05-07T20:32:43.2388757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2388952Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9b1aaf0>
2025-05-07T20:32:43.2388960Z 
2025-05-07T20:32:43.2389169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2389439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2389548Z                            module_map=module_map)
2025-05-07T20:32:43.2389715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2389813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2389892Z E       ^
2025-05-07T20:32:43.2390242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2390247Z 
2025-05-07T20:32:43.2390657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2390661Z 
2025-05-07T20:32:43.2394813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2395055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2395139Z     T=4096,
2025-05-07T20:32:43.2395219Z     D=7168,
2025-05-07T20:32:43.2395303Z     scale_ub=1200.0,
2025-05-07T20:32:43.2395387Z     contiguous=False,
2025-05-07T20:32:43.2395473Z     compiled=False,
2025-05-07T20:32:43.2395546Z )
2025-05-07T20:32:43.2395834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2396080Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2396085Z 
2025-05-07T20:32:43.2396165Z     @given(
2025-05-07T20:32:43.2396287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2396386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2396500Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2396617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2396729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2396808Z     )
2025-05-07T20:32:43.2397058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2397193Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2397274Z         self,
2025-05-07T20:32:43.2397350Z         T: int,
2025-05-07T20:32:43.2397426Z         D: int,
2025-05-07T20:32:43.2397527Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2397622Z         contiguous: bool,
2025-05-07T20:32:43.2397707Z         compiled: bool,
2025-05-07T20:32:43.2397788Z     ) -> None:
2025-05-07T20:32:43.2397882Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2397956Z     
2025-05-07T20:32:43.2398126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2398200Z     
2025-05-07T20:32:43.2398291Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2398419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2398508Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2398591Z         x0 = x[:, :D]
2025-05-07T20:32:43.2398670Z         x1 = x[:, D:]
2025-05-07T20:32:43.2398745Z     
2025-05-07T20:32:43.2398834Z         if contiguous:
2025-05-07T20:32:43.2398926Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2399014Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2399090Z     
2025-05-07T20:32:43.2399180Z         if scale_ub is not None:
2025-05-07T20:32:43.2399285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2399431Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2399507Z             )
2025-05-07T20:32:43.2399585Z         else:
2025-05-07T20:32:43.2399682Z             scale_ub_tensor = None
2025-05-07T20:32:43.2399754Z     
2025-05-07T20:32:43.2399887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2399977Z             op = silu_mul_quant
2025-05-07T20:32:43.2400062Z             if compiled:
2025-05-07T20:32:43.2400165Z                 op = torch.compile(op)
2025-05-07T20:32:43.2400270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2400341Z     
2025-05-07T20:32:43.2400484Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2400489Z 
2025-05-07T20:32:43.2400588Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2400715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2400820Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2400926Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2401438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2401535Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2401890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2402121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2402456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2402555Z     kernel = self.compile(
2025-05-07T20:32:43.2402937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2403114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2403291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2403976Z 
2025-05-07T20:32:43.2404193Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9490430>
2025-05-07T20:32:43.2404966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2405486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd97161f0>}
2025-05-07T20:32:43.2406286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2406487Z context = <triton._C.libtriton.ir.context object at 0x7fbfd973fe70>
2025-05-07T20:32:43.2406494Z 
2025-05-07T20:32:43.2406666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2406935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2407043Z                            module_map=module_map)
2025-05-07T20:32:43.2407206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2407307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2407387Z E       ^
2025-05-07T20:32:43.2407741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2407746Z 
2025-05-07T20:32:43.2408165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2408169Z 
2025-05-07T20:32:43.2408272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2408500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2408582Z     T=16384,
2025-05-07T20:32:43.2408659Z     D=7168,
2025-05-07T20:32:43.2408744Z     scale_ub=None,
2025-05-07T20:32:43.2408831Z     contiguous=True,
2025-05-07T20:32:43.2408915Z     compiled=True,
2025-05-07T20:32:43.2408990Z )
2025-05-07T20:32:43.2409207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2409380Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2409389Z 
2025-05-07T20:32:43.2409466Z     @given(
2025-05-07T20:32:43.2409584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2409685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2409848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2409969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2410087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2410161Z     )
2025-05-07T20:32:43.2410408Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2410508Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2410587Z         self,
2025-05-07T20:32:43.2410666Z         T: int,
2025-05-07T20:32:43.2410764Z         D: int,
2025-05-07T20:32:43.2410872Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2410982Z         contiguous: bool,
2025-05-07T20:32:43.2411067Z         compiled: bool,
2025-05-07T20:32:43.2411146Z     ) -> None:
2025-05-07T20:32:43.2411242Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2411315Z     
2025-05-07T20:32:43.2411482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2411564Z     
2025-05-07T20:32:43.2411660Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2411785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2411877Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2411961Z         x0 = x[:, :D]
2025-05-07T20:32:43.2412042Z         x1 = x[:, D:]
2025-05-07T20:32:43.2412161Z     
2025-05-07T20:32:43.2412285Z         if contiguous:
2025-05-07T20:32:43.2412378Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2412469Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2412541Z     
2025-05-07T20:32:43.2412634Z         if scale_ub is not None:
2025-05-07T20:32:43.2412739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2412872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2412953Z             )
2025-05-07T20:32:43.2413030Z         else:
2025-05-07T20:32:43.2413122Z             scale_ub_tensor = None
2025-05-07T20:32:43.2413198Z     
2025-05-07T20:32:43.2413332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2413465Z             op = silu_mul_quant
2025-05-07T20:32:43.2413555Z             if compiled:
2025-05-07T20:32:43.2413656Z                 op = torch.compile(op)
2025-05-07T20:32:43.2413760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2413834Z     
2025-05-07T20:32:43.2413930Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2413935Z 
2025-05-07T20:32:43.2414036Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2414163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2414265Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2414367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2414729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2414822Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2415316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2415416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2415776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2416004Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2416342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2416441Z     kernel = self.compile(
2025-05-07T20:32:43.2416825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2417001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2417126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2417130Z 
2025-05-07T20:32:43.2417378Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd94c8850>
2025-05-07T20:32:43.2418167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2418687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9716ee0>}
2025-05-07T20:32:43.2419438Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2419629Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9b601f0>
2025-05-07T20:32:43.2419634Z 
2025-05-07T20:32:43.2419797Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2420065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2420173Z                            module_map=module_map)
2025-05-07T20:32:43.2420340Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2420438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2420555Z E       ^
2025-05-07T20:32:43.2420951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2420956Z 
2025-05-07T20:32:43.2421450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2421455Z 
2025-05-07T20:32:43.2421562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2421782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2421861Z     T=4096,
2025-05-07T20:32:43.2421940Z     D=5120,
2025-05-07T20:32:43.2422022Z     scale_ub=None,
2025-05-07T20:32:43.2422154Z     contiguous=False,
2025-05-07T20:32:43.2422240Z     compiled=True,
2025-05-07T20:32:43.2422312Z )
2025-05-07T20:32:43.2422526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2422704Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2422711Z 
2025-05-07T20:32:43.2422792Z     @given(
2025-05-07T20:32:43.2422909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2423011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2423126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2423244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2423358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2423431Z     )
2025-05-07T20:32:43.2423680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2423773Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2423854Z         self,
2025-05-07T20:32:43.2423936Z         T: int,
2025-05-07T20:32:43.2424013Z         D: int,
2025-05-07T20:32:43.2424110Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2424202Z         contiguous: bool,
2025-05-07T20:32:43.2424289Z         compiled: bool,
2025-05-07T20:32:43.2424373Z     ) -> None:
2025-05-07T20:32:43.2424473Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2424545Z     
2025-05-07T20:32:43.2424718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2424792Z     
2025-05-07T20:32:43.2424882Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2425013Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2425101Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2425181Z         x0 = x[:, :D]
2025-05-07T20:32:43.2425266Z         x1 = x[:, D:]
2025-05-07T20:32:43.2425338Z     
2025-05-07T20:32:43.2425420Z         if contiguous:
2025-05-07T20:32:43.2425515Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2425654Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2425727Z     
2025-05-07T20:32:43.2425826Z         if scale_ub is not None:
2025-05-07T20:32:43.2425932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2426070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2426147Z             )
2025-05-07T20:32:43.2426227Z         else:
2025-05-07T20:32:43.2426322Z             scale_ub_tensor = None
2025-05-07T20:32:43.2426394Z     
2025-05-07T20:32:43.2426524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2426616Z             op = silu_mul_quant
2025-05-07T20:32:43.2426701Z             if compiled:
2025-05-07T20:32:43.2426800Z                 op = torch.compile(op)
2025-05-07T20:32:43.2426908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2426987Z     
2025-05-07T20:32:43.2427077Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2427081Z 
2025-05-07T20:32:43.2427178Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2427314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2427414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2427513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2427949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2428081Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2428573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2428671Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2429030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2429256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2429594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2429730Z     kernel = self.compile(
2025-05-07T20:32:43.2430113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2430293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2430428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2430432Z 
2025-05-07T20:32:43.2430637Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9b7ad90>
2025-05-07T20:32:43.2431402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2431912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9b87940>}
2025-05-07T20:32:43.2432663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2432862Z context = <triton._C.libtriton.ir.context object at 0x7fbfd96d52f0>
2025-05-07T20:32:43.2432869Z 
2025-05-07T20:32:43.2433038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2433309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2433415Z                            module_map=module_map)
2025-05-07T20:32:43.2433577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2433678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2433758Z E       ^
2025-05-07T20:32:43.2434108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2434155Z 
2025-05-07T20:32:43.2434573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2434578Z 
2025-05-07T20:32:43.2434679Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2434911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2434994Z     T=4096,
2025-05-07T20:32:43.2435069Z     D=5120,
2025-05-07T20:32:43.2435156Z     scale_ub=1200.0,
2025-05-07T20:32:43.2435242Z     contiguous=False,
2025-05-07T20:32:43.2435329Z     compiled=False,
2025-05-07T20:32:43.2435407Z )
2025-05-07T20:32:43.2435622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2435798Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2435807Z 
2025-05-07T20:32:43.2435883Z     @given(
2025-05-07T20:32:43.2436000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2436110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2436223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2436338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2436455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2436575Z     )
2025-05-07T20:32:43.2436854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2436950Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2437026Z         self,
2025-05-07T20:32:43.2437106Z         T: int,
2025-05-07T20:32:43.2437181Z         D: int,
2025-05-07T20:32:43.2437278Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2437370Z         contiguous: bool,
2025-05-07T20:32:43.2437455Z         compiled: bool,
2025-05-07T20:32:43.2437532Z     ) -> None:
2025-05-07T20:32:43.2437629Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2437701Z     
2025-05-07T20:32:43.2437875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2437993Z     
2025-05-07T20:32:43.2438085Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2438211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2438301Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2438381Z         x0 = x[:, :D]
2025-05-07T20:32:43.2438467Z         x1 = x[:, D:]
2025-05-07T20:32:43.2438542Z     
2025-05-07T20:32:43.2438624Z         if contiguous:
2025-05-07T20:32:43.2438719Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2438810Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2438882Z     
2025-05-07T20:32:43.2438973Z         if scale_ub is not None:
2025-05-07T20:32:43.2439079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2439212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2439291Z             )
2025-05-07T20:32:43.2439368Z         else:
2025-05-07T20:32:43.2439460Z             scale_ub_tensor = None
2025-05-07T20:32:43.2439538Z     
2025-05-07T20:32:43.2439672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2439763Z             op = silu_mul_quant
2025-05-07T20:32:43.2439852Z             if compiled:
2025-05-07T20:32:43.2439954Z                 op = torch.compile(op)
2025-05-07T20:32:43.2440237Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2440353Z     
2025-05-07T20:32:43.2440485Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2440492Z 
2025-05-07T20:32:43.2440624Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2440778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2440893Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2441008Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2441503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2441604Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2442063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2442290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2442634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2442731Z     kernel = self.compile(
2025-05-07T20:32:43.2443108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2443285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2443410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2443414Z 
2025-05-07T20:32:43.2443622Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd95e06d0>
2025-05-07T20:32:43.2444392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2444952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd95ef3a0>}
2025-05-07T20:32:43.2445757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2445949Z context = <triton._C.libtriton.ir.context object at 0x7fbfd95e32b0>
2025-05-07T20:32:43.2445954Z 
2025-05-07T20:32:43.2446125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2446385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2446497Z                            module_map=module_map)
2025-05-07T20:32:43.2446722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2446819Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2446898Z E       ^
2025-05-07T20:32:43.2447250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2447260Z 
2025-05-07T20:32:43.2447671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2447676Z 
2025-05-07T20:32:43.2447780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2448001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2448081Z     T=4096,
2025-05-07T20:32:43.2448155Z     D=5120,
2025-05-07T20:32:43.2448237Z     scale_ub=1200.0,
2025-05-07T20:32:43.2448326Z     contiguous=False,
2025-05-07T20:32:43.2448408Z     compiled=True,
2025-05-07T20:32:43.2448481Z )
2025-05-07T20:32:43.2448707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2448880Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2448885Z 
2025-05-07T20:32:43.2448961Z     @given(
2025-05-07T20:32:43.2449082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2449185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2449303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2449419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2449531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2449611Z     )
2025-05-07T20:32:43.2449854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2449948Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2450026Z         self,
2025-05-07T20:32:43.2450102Z         T: int,
2025-05-07T20:32:43.2450179Z         D: int,
2025-05-07T20:32:43.2450325Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2450417Z         contiguous: bool,
2025-05-07T20:32:43.2450502Z         compiled: bool,
2025-05-07T20:32:43.2450583Z     ) -> None:
2025-05-07T20:32:43.2450678Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2450754Z     
2025-05-07T20:32:43.2450925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2451002Z     
2025-05-07T20:32:43.2451097Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2451221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2451309Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2451391Z         x0 = x[:, :D]
2025-05-07T20:32:43.2451470Z         x1 = x[:, D:]
2025-05-07T20:32:43.2451544Z     
2025-05-07T20:32:43.2451632Z         if contiguous:
2025-05-07T20:32:43.2451723Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2451812Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2451888Z     
2025-05-07T20:32:43.2451977Z         if scale_ub is not None:
2025-05-07T20:32:43.2452093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2452232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2452307Z             )
2025-05-07T20:32:43.2452384Z         else:
2025-05-07T20:32:43.2452482Z             scale_ub_tensor = None
2025-05-07T20:32:43.2452600Z     
2025-05-07T20:32:43.2452768Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2452865Z             op = silu_mul_quant
2025-05-07T20:32:43.2452949Z             if compiled:
2025-05-07T20:32:43.2453051Z                 op = torch.compile(op)
2025-05-07T20:32:43.2453156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2453228Z     
2025-05-07T20:32:43.2453321Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2453325Z 
2025-05-07T20:32:43.2453424Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2453549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2453657Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2453796Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2454164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2454259Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2454758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2454861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2455223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2455448Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2455793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2455885Z     kernel = self.compile(
2025-05-07T20:32:43.2456267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2456445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2456570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2456577Z 
2025-05-07T20:32:43.2456789Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd95fa070>
2025-05-07T20:32:43.2457552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2458060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd95ef280>}
2025-05-07T20:32:43.2458864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2459062Z context = <triton._C.libtriton.ir.context object at 0x7fbfd97da130>
2025-05-07T20:32:43.2459067Z 
2025-05-07T20:32:43.2459240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2459511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2459621Z                            module_map=module_map)
2025-05-07T20:32:43.2459783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2459881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2459959Z E       ^
2025-05-07T20:32:43.2460309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2460314Z 
2025-05-07T20:32:43.2460723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2460732Z 
2025-05-07T20:32:43.2460832Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2461119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2461202Z     T=2048,
2025-05-07T20:32:43.2461322Z     D=7168,
2025-05-07T20:32:43.2461440Z     scale_ub=1200.0,
2025-05-07T20:32:43.2461531Z     contiguous=False,
2025-05-07T20:32:43.2461616Z     compiled=False,
2025-05-07T20:32:43.2461688Z )
2025-05-07T20:32:43.2461912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2462091Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2462096Z 
2025-05-07T20:32:43.2462176Z     @given(
2025-05-07T20:32:43.2462294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2462391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2462509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2462672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2462788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2462863Z     )
2025-05-07T20:32:43.2463108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2463208Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2463287Z         self,
2025-05-07T20:32:43.2463362Z         T: int,
2025-05-07T20:32:43.2463441Z         D: int,
2025-05-07T20:32:43.2463542Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2463629Z         contiguous: bool,
2025-05-07T20:32:43.2463719Z         compiled: bool,
2025-05-07T20:32:43.2463797Z     ) -> None:
2025-05-07T20:32:43.2463892Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2463969Z     
2025-05-07T20:32:43.2464139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2464211Z     
2025-05-07T20:32:43.2464306Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2464441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2464531Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2464618Z         x0 = x[:, :D]
2025-05-07T20:32:43.2464697Z         x1 = x[:, D:]
2025-05-07T20:32:43.2464769Z     
2025-05-07T20:32:43.2464854Z         if contiguous:
2025-05-07T20:32:43.2464952Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2465043Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2465114Z     
2025-05-07T20:32:43.2465203Z         if scale_ub is not None:
2025-05-07T20:32:43.2465309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2465443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2465518Z             )
2025-05-07T20:32:43.2465595Z         else:
2025-05-07T20:32:43.2465687Z             scale_ub_tensor = None
2025-05-07T20:32:43.2465759Z     
2025-05-07T20:32:43.2465890Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2465980Z             op = silu_mul_quant
2025-05-07T20:32:43.2466113Z             if compiled:
2025-05-07T20:32:43.2466216Z                 op = torch.compile(op)
2025-05-07T20:32:43.2466322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2466397Z     
2025-05-07T20:32:43.2466487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2466494Z 
2025-05-07T20:32:43.2466594Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2466724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2466826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2466926Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2467430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2467525Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2467884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2468119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2468461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2468559Z     kernel = self.compile(
2025-05-07T20:32:43.2468976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2469189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2469316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2469320Z 
2025-05-07T20:32:43.2469528Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd97ce760>
2025-05-07T20:32:43.2470298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2470876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd97ed670>}
2025-05-07T20:32:43.2471641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2471837Z context = <triton._C.libtriton.ir.context object at 0x7fbfd96622f0>
2025-05-07T20:32:43.2471841Z 
2025-05-07T20:32:43.2472005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2472268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2472374Z                            module_map=module_map)
2025-05-07T20:32:43.2472534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2472641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2472717Z E       ^
2025-05-07T20:32:43.2473070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2473074Z 
2025-05-07T20:32:43.2473485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2473491Z 
2025-05-07T20:32:43.2473594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2473820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2473896Z     T=1,
2025-05-07T20:32:43.2473974Z     D=7168,
2025-05-07T20:32:43.2474054Z     scale_ub=None,
2025-05-07T20:32:43.2474138Z     contiguous=True,
2025-05-07T20:32:43.2474222Z     compiled=False,
2025-05-07T20:32:43.2474294Z )
2025-05-07T20:32:43.2474510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2474717Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2474725Z 
2025-05-07T20:32:43.2474801Z     @given(
2025-05-07T20:32:43.2474919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2475022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2475137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2475260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2475374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2475447Z     )
2025-05-07T20:32:43.2475693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2475787Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2475863Z         self,
2025-05-07T20:32:43.2475943Z         T: int,
2025-05-07T20:32:43.2476020Z         D: int,
2025-05-07T20:32:43.2476117Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2476210Z         contiguous: bool,
2025-05-07T20:32:43.2476295Z         compiled: bool,
2025-05-07T20:32:43.2476383Z     ) -> None:
2025-05-07T20:32:43.2476478Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2476550Z     
2025-05-07T20:32:43.2476717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2476793Z     
2025-05-07T20:32:43.2476884Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2477095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2477185Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2477267Z         x0 = x[:, :D]
2025-05-07T20:32:43.2477348Z         x1 = x[:, D:]
2025-05-07T20:32:43.2477419Z     
2025-05-07T20:32:43.2477502Z         if contiguous:
2025-05-07T20:32:43.2477596Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2477684Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2477756Z     
2025-05-07T20:32:43.2477850Z         if scale_ub is not None:
2025-05-07T20:32:43.2477955Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2478096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2478215Z             )
2025-05-07T20:32:43.2478291Z         else:
2025-05-07T20:32:43.2478385Z             scale_ub_tensor = None
2025-05-07T20:32:43.2478457Z     
2025-05-07T20:32:43.2478587Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2478678Z             op = silu_mul_quant
2025-05-07T20:32:43.2478768Z             if compiled:
2025-05-07T20:32:43.2478866Z                 op = torch.compile(op)
2025-05-07T20:32:43.2478974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2479045Z     
2025-05-07T20:32:43.2479135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2479140Z 
2025-05-07T20:32:43.2479240Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2479367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2479470Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2479570Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2480066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2480166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2480521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2480752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2481092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2481185Z     kernel = self.compile(
2025-05-07T20:32:43.2481572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2481744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2481869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2481873Z 
2025-05-07T20:32:43.2482125Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9b81a90>
2025-05-07T20:32:43.2482904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2483416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd96a1280>}
2025-05-07T20:32:43.2484151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2484343Z context = <triton._C.libtriton.ir.context object at 0x7fbfd968dcf0>
2025-05-07T20:32:43.2484349Z 
2025-05-07T20:32:43.2484516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2484783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2484895Z                            module_map=module_map)
2025-05-07T20:32:43.2485055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2485225Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2485306Z E       ^
2025-05-07T20:32:43.2485658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2485663Z 
2025-05-07T20:32:43.2486082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2486087Z 
2025-05-07T20:32:43.2486192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2486413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2486493Z     T=16384,
2025-05-07T20:32:43.2486570Z     D=7168,
2025-05-07T20:32:43.2486718Z     scale_ub=1200.0,
2025-05-07T20:32:43.2486808Z     contiguous=False,
2025-05-07T20:32:43.2486895Z     compiled=True,
2025-05-07T20:32:43.2486967Z )
2025-05-07T20:32:43.2487185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2487365Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2487371Z 
2025-05-07T20:32:43.2487450Z     @given(
2025-05-07T20:32:43.2487568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2487667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2487783Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2487898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2488012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2488088Z     )
2025-05-07T20:32:43.2488331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2488429Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2488508Z         self,
2025-05-07T20:32:43.2488582Z         T: int,
2025-05-07T20:32:43.2488661Z         D: int,
2025-05-07T20:32:43.2488757Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2488847Z         contiguous: bool,
2025-05-07T20:32:43.2488934Z         compiled: bool,
2025-05-07T20:32:43.2489017Z     ) -> None:
2025-05-07T20:32:43.2489110Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2489184Z     
2025-05-07T20:32:43.2489350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2489423Z     
2025-05-07T20:32:43.2489519Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2489644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2489733Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2489815Z         x0 = x[:, :D]
2025-05-07T20:32:43.2489894Z         x1 = x[:, D:]
2025-05-07T20:32:43.2489970Z     
2025-05-07T20:32:43.2490053Z         if contiguous:
2025-05-07T20:32:43.2490190Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2490284Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2490355Z     
2025-05-07T20:32:43.2490444Z         if scale_ub is not None:
2025-05-07T20:32:43.2490556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2490695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2490774Z             )
2025-05-07T20:32:43.2490853Z         else:
2025-05-07T20:32:43.2490945Z             scale_ub_tensor = None
2025-05-07T20:32:43.2491018Z     
2025-05-07T20:32:43.2491150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2491240Z             op = silu_mul_quant
2025-05-07T20:32:43.2491324Z             if compiled:
2025-05-07T20:32:43.2491425Z                 op = torch.compile(op)
2025-05-07T20:32:43.2491529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2491603Z     
2025-05-07T20:32:43.2491694Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2491697Z 
2025-05-07T20:32:43.2491801Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2491929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2492030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2492130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2492537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2492665Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2493165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2493261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2493615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2493840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2494176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2494308Z     kernel = self.compile(
2025-05-07T20:32:43.2494692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2494873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2495002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2495006Z 
2025-05-07T20:32:43.2495212Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9618370>
2025-05-07T20:32:43.2495977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2496487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd96a1ee0>}
2025-05-07T20:32:43.2497236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2497442Z context = <triton._C.libtriton.ir.context object at 0x7fbfd975fc30>
2025-05-07T20:32:43.2497446Z 
2025-05-07T20:32:43.2497613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2497877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2497985Z                            module_map=module_map)
2025-05-07T20:32:43.2498145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2498245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2498322Z E       ^
2025-05-07T20:32:43.2498712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2498720Z 
2025-05-07T20:32:43.2499137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2499142Z 
2025-05-07T20:32:43.2499244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2499473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2499548Z     T=1,
2025-05-07T20:32:43.2499623Z     D=7168,
2025-05-07T20:32:43.2499708Z     scale_ub=None,
2025-05-07T20:32:43.2499791Z     contiguous=False,
2025-05-07T20:32:43.2499874Z     compiled=False,
2025-05-07T20:32:43.2499952Z )
2025-05-07T20:32:43.2500170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2500337Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2500345Z 
2025-05-07T20:32:43.2500421Z     @given(
2025-05-07T20:32:43.2500546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2500646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2500761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2500878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2500994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2501202Z     )
2025-05-07T20:32:43.2501449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2501546Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2501621Z         self,
2025-05-07T20:32:43.2501697Z         T: int,
2025-05-07T20:32:43.2501775Z         D: int,
2025-05-07T20:32:43.2501873Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2501964Z         contiguous: bool,
2025-05-07T20:32:43.2502050Z         compiled: bool,
2025-05-07T20:32:43.2502128Z     ) -> None:
2025-05-07T20:32:43.2502225Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2502297Z     
2025-05-07T20:32:43.2502513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2502591Z     
2025-05-07T20:32:43.2502683Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2502807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2502900Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2502986Z         x0 = x[:, :D]
2025-05-07T20:32:43.2503065Z         x1 = x[:, D:]
2025-05-07T20:32:43.2503140Z     
2025-05-07T20:32:43.2503222Z         if contiguous:
2025-05-07T20:32:43.2503313Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2503408Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2503479Z     
2025-05-07T20:32:43.2503573Z         if scale_ub is not None:
2025-05-07T20:32:43.2503678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2503814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2503894Z             )
2025-05-07T20:32:43.2503970Z         else:
2025-05-07T20:32:43.2504066Z             scale_ub_tensor = None
2025-05-07T20:32:43.2504144Z     
2025-05-07T20:32:43.2504272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2504363Z             op = silu_mul_quant
2025-05-07T20:32:43.2504452Z             if compiled:
2025-05-07T20:32:43.2504550Z                 op = torch.compile(op)
2025-05-07T20:32:43.2504659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2504735Z     
2025-05-07T20:32:43.2504826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2504830Z 
2025-05-07T20:32:43.2504930Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2505057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2505158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2505261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2505753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2505894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2506253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2506477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2506823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2506919Z     kernel = self.compile(
2025-05-07T20:32:43.2507303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2507481Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2507604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2507609Z 
2025-05-07T20:32:43.2507817Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd96bcac0>
2025-05-07T20:32:43.2508584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2509136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd977d670>}
2025-05-07T20:32:43.2509921Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2510113Z context = <triton._C.libtriton.ir.context object at 0x7fbfd93c1570>
2025-05-07T20:32:43.2510118Z 
2025-05-07T20:32:43.2510286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2510545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2510697Z                            module_map=module_map)
2025-05-07T20:32:43.2510862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2510962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2511042Z E       ^
2025-05-07T20:32:43.2511397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2511404Z 
2025-05-07T20:32:43.2511818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2511822Z 
2025-05-07T20:32:43.2511926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2512151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2512227Z     T=2048,
2025-05-07T20:32:43.2512305Z     D=7168,
2025-05-07T20:32:43.2512385Z     scale_ub=None,
2025-05-07T20:32:43.2512474Z     contiguous=False,
2025-05-07T20:32:43.2512557Z     compiled=True,
2025-05-07T20:32:43.2512634Z )
2025-05-07T20:32:43.2512852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2513029Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2513034Z 
2025-05-07T20:32:43.2513114Z     @given(
2025-05-07T20:32:43.2513238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2513336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2513449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2513568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2513683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2513760Z     )
2025-05-07T20:32:43.2518034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2518145Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2518223Z         self,
2025-05-07T20:32:43.2518304Z         T: int,
2025-05-07T20:32:43.2518384Z         D: int,
2025-05-07T20:32:43.2518550Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2518644Z         contiguous: bool,
2025-05-07T20:32:43.2518729Z         compiled: bool,
2025-05-07T20:32:43.2518810Z     ) -> None:
2025-05-07T20:32:43.2518909Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2518983Z     
2025-05-07T20:32:43.2519162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2519236Z     
2025-05-07T20:32:43.2519329Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2519458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2519546Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2519627Z         x0 = x[:, :D]
2025-05-07T20:32:43.2519713Z         x1 = x[:, D:]
2025-05-07T20:32:43.2519785Z     
2025-05-07T20:32:43.2519869Z         if contiguous:
2025-05-07T20:32:43.2519965Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2520052Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2520127Z     
2025-05-07T20:32:43.2520226Z         if scale_ub is not None:
2025-05-07T20:32:43.2520331Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2520466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2520545Z             )
2025-05-07T20:32:43.2520622Z         else:
2025-05-07T20:32:43.2520761Z             scale_ub_tensor = None
2025-05-07T20:32:43.2520896Z     
2025-05-07T20:32:43.2521030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2521122Z             op = silu_mul_quant
2025-05-07T20:32:43.2521207Z             if compiled:
2025-05-07T20:32:43.2521307Z                 op = torch.compile(op)
2025-05-07T20:32:43.2521417Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2521489Z     
2025-05-07T20:32:43.2521580Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2521586Z 
2025-05-07T20:32:43.2521688Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2521816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2521966Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2522067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2522435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2522532Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2523034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2523132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2523490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2523716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2524054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2524146Z     kernel = self.compile(
2025-05-07T20:32:43.2524528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2524708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2524833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2524843Z 
2025-05-07T20:32:43.2525055Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9279ac0>
2025-05-07T20:32:43.2525823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2526336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd945b550>}
2025-05-07T20:32:43.2527118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2527316Z context = <triton._C.libtriton.ir.context object at 0x7fbfd945e870>
2025-05-07T20:32:43.2527321Z 
2025-05-07T20:32:43.2527494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2527755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2527862Z                            module_map=module_map)
2025-05-07T20:32:43.2528027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2528127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2528204Z E       ^
2025-05-07T20:32:43.2528562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2528567Z 
2025-05-07T20:32:43.2528977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2528985Z 
2025-05-07T20:32:43.2529093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2529319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2529435Z     T=4096,
2025-05-07T20:32:43.2529550Z     D=7168,
2025-05-07T20:32:43.2529633Z     scale_ub=None,
2025-05-07T20:32:43.2529721Z     contiguous=False,
2025-05-07T20:32:43.2529803Z     compiled=True,
2025-05-07T20:32:43.2529876Z )
2025-05-07T20:32:43.2530093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2530267Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2530272Z 
2025-05-07T20:32:43.2530348Z     @given(
2025-05-07T20:32:43.2530469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2530574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2530738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2530860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2530972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2531048Z     )
2025-05-07T20:32:43.2531293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2531392Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2531472Z         self,
2025-05-07T20:32:43.2531549Z         T: int,
2025-05-07T20:32:43.2531626Z         D: int,
2025-05-07T20:32:43.2531725Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2531817Z         contiguous: bool,
2025-05-07T20:32:43.2531903Z         compiled: bool,
2025-05-07T20:32:43.2531985Z     ) -> None:
2025-05-07T20:32:43.2532078Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2532150Z     
2025-05-07T20:32:43.2532321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2532395Z     
2025-05-07T20:32:43.2532495Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2532619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2532707Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2532790Z         x0 = x[:, :D]
2025-05-07T20:32:43.2532870Z         x1 = x[:, D:]
2025-05-07T20:32:43.2532941Z     
2025-05-07T20:32:43.2533032Z         if contiguous:
2025-05-07T20:32:43.2533124Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2533214Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2533289Z     
2025-05-07T20:32:43.2533380Z         if scale_ub is not None:
2025-05-07T20:32:43.2533484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2533621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2533697Z             )
2025-05-07T20:32:43.2533775Z         else:
2025-05-07T20:32:43.2533874Z             scale_ub_tensor = None
2025-05-07T20:32:43.2533946Z     
2025-05-07T20:32:43.2534080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2534219Z             op = silu_mul_quant
2025-05-07T20:32:43.2534305Z             if compiled:
2025-05-07T20:32:43.2534407Z                 op = torch.compile(op)
2025-05-07T20:32:43.2534511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2534583Z     
2025-05-07T20:32:43.2534676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2534685Z 
2025-05-07T20:32:43.2534780Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2534910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2535012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2535111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2535478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2535571Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2536064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2536167Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2536520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2536743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2537159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2537254Z     kernel = self.compile(
2025-05-07T20:32:43.2537638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2537812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2537938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2537942Z 
2025-05-07T20:32:43.2538151Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd947a280>
2025-05-07T20:32:43.2538972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2539490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd93d0160>}
2025-05-07T20:32:43.2540558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2540770Z context = <triton._C.libtriton.ir.context object at 0x7fbfd93ff8f0>
2025-05-07T20:32:43.2540775Z 
2025-05-07T20:32:43.2540944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2541274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2541394Z                            module_map=module_map)
2025-05-07T20:32:43.2541561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2541663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2541750Z E       ^
2025-05-07T20:32:43.2542112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2542117Z 
2025-05-07T20:32:43.2542528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2542532Z 
2025-05-07T20:32:43.2542633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2542854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2542933Z     T=16384,
2025-05-07T20:32:43.2543011Z     D=5120,
2025-05-07T20:32:43.2543096Z     scale_ub=1200.0,
2025-05-07T20:32:43.2543190Z     contiguous=False,
2025-05-07T20:32:43.2543367Z     compiled=False,
2025-05-07T20:32:43.2543441Z )
2025-05-07T20:32:43.2543669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2543848Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2543853Z 
2025-05-07T20:32:43.2543938Z     @given(
2025-05-07T20:32:43.2544056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2544157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2544275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2544391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2544502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2544581Z     )
2025-05-07T20:32:43.2544830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2544928Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2545006Z         self,
2025-05-07T20:32:43.2545088Z         T: int,
2025-05-07T20:32:43.2545166Z         D: int,
2025-05-07T20:32:43.2545264Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2545352Z         contiguous: bool,
2025-05-07T20:32:43.2545440Z         compiled: bool,
2025-05-07T20:32:43.2545519Z     ) -> None:
2025-05-07T20:32:43.2545613Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2545812Z     
2025-05-07T20:32:43.2545986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2546061Z     
2025-05-07T20:32:43.2546156Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2546279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2546372Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2546452Z         x0 = x[:, :D]
2025-05-07T20:32:43.2546531Z         x1 = x[:, D:]
2025-05-07T20:32:43.2546608Z     
2025-05-07T20:32:43.2546692Z         if contiguous:
2025-05-07T20:32:43.2546783Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2546874Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2547014Z     
2025-05-07T20:32:43.2547107Z         if scale_ub is not None:
2025-05-07T20:32:43.2547217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2547352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2547432Z             )
2025-05-07T20:32:43.2547514Z         else:
2025-05-07T20:32:43.2547609Z             scale_ub_tensor = None
2025-05-07T20:32:43.2547683Z     
2025-05-07T20:32:43.2547813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2547903Z             op = silu_mul_quant
2025-05-07T20:32:43.2547992Z             if compiled:
2025-05-07T20:32:43.2548092Z                 op = torch.compile(op)
2025-05-07T20:32:43.2548196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2548273Z     
2025-05-07T20:32:43.2548365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2548370Z 
2025-05-07T20:32:43.2548466Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2548601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2548706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2548805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2549305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2549402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2549762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2549985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2550320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2550416Z     kernel = self.compile(
2025-05-07T20:32:43.2550792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2551022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2551148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2551153Z 
2025-05-07T20:32:43.2551365Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9400280>
2025-05-07T20:32:43.2552135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2552640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd93d0940>}
2025-05-07T20:32:43.2553381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2553579Z context = <triton._C.libtriton.ir.context object at 0x7fbfd91df7f0>
2025-05-07T20:32:43.2553584Z 
2025-05-07T20:32:43.2553751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2554078Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2554224Z                            module_map=module_map)
2025-05-07T20:32:43.2554385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2554483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2554562Z E       ^
2025-05-07T20:32:43.2554920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2554925Z 
2025-05-07T20:32:43.2555342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2555346Z 
2025-05-07T20:32:43.2555494Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2555715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2555797Z     T=16384,
2025-05-07T20:32:43.2555873Z     D=5120,
2025-05-07T20:32:43.2555956Z     scale_ub=1200.0,
2025-05-07T20:32:43.2556046Z     contiguous=True,
2025-05-07T20:32:43.2556132Z     compiled=True,
2025-05-07T20:32:43.2556210Z )
2025-05-07T20:32:43.2556426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2556600Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2556604Z 
2025-05-07T20:32:43.2556688Z     @given(
2025-05-07T20:32:43.2556806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2556909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2557025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2557145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2557266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2557342Z     )
2025-05-07T20:32:43.2557587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2557684Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2557759Z         self,
2025-05-07T20:32:43.2557840Z         T: int,
2025-05-07T20:32:43.2557923Z         D: int,
2025-05-07T20:32:43.2558021Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2558109Z         contiguous: bool,
2025-05-07T20:32:43.2558201Z         compiled: bool,
2025-05-07T20:32:43.2558280Z     ) -> None:
2025-05-07T20:32:43.2558373Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2558449Z     
2025-05-07T20:32:43.2558617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2558689Z     
2025-05-07T20:32:43.2558783Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2558907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2559047Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2559131Z         x0 = x[:, :D]
2025-05-07T20:32:43.2559210Z         x1 = x[:, D:]
2025-05-07T20:32:43.2559283Z     
2025-05-07T20:32:43.2559367Z         if contiguous:
2025-05-07T20:32:43.2559457Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2559550Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2559626Z     
2025-05-07T20:32:43.2559716Z         if scale_ub is not None:
2025-05-07T20:32:43.2559826Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2559958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2560033Z             )
2025-05-07T20:32:43.2560114Z         else:
2025-05-07T20:32:43.2560210Z             scale_ub_tensor = None
2025-05-07T20:32:43.2560282Z     
2025-05-07T20:32:43.2560416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2560505Z             op = silu_mul_quant
2025-05-07T20:32:43.2560593Z             if compiled:
2025-05-07T20:32:43.2560698Z                 op = torch.compile(op)
2025-05-07T20:32:43.2560803Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2560878Z     
2025-05-07T20:32:43.2560968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2560973Z 
2025-05-07T20:32:43.2561068Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2561284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2561386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2561485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2561849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2561942Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2562441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2562538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2562895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2563174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2563516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2563622Z     kernel = self.compile(
2025-05-07T20:32:43.2563998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2564176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2564304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2564308Z 
2025-05-07T20:32:43.2564515Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd928e6d0>
2025-05-07T20:32:43.2565280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2565795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd932a550>}
2025-05-07T20:32:43.2566536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2566732Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9317cb0>
2025-05-07T20:32:43.2566736Z 
2025-05-07T20:32:43.2566900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2567166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2567273Z                            module_map=module_map)
2025-05-07T20:32:43.2567478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2567582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2567658Z E       ^
2025-05-07T20:32:43.2568014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2568025Z 
2025-05-07T20:32:43.2568444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2568449Z 
2025-05-07T20:32:43.2568550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2568775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2568853Z     T=16384,
2025-05-07T20:32:43.2568928Z     D=5120,
2025-05-07T20:32:43.2569012Z     scale_ub=None,
2025-05-07T20:32:43.2569102Z     contiguous=False,
2025-05-07T20:32:43.2569184Z     compiled=True,
2025-05-07T20:32:43.2569259Z )
2025-05-07T20:32:43.2569477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2569662Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2569667Z 
2025-05-07T20:32:43.2569744Z     @given(
2025-05-07T20:32:43.2569860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2570044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2570162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2570277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2570394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2570470Z     )
2025-05-07T20:32:43.2570713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2570808Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2570882Z         self,
2025-05-07T20:32:43.2570962Z         T: int,
2025-05-07T20:32:43.2571039Z         D: int,
2025-05-07T20:32:43.2571137Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2571275Z         contiguous: bool,
2025-05-07T20:32:43.2571359Z         compiled: bool,
2025-05-07T20:32:43.2571438Z     ) -> None:
2025-05-07T20:32:43.2571536Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2571608Z     
2025-05-07T20:32:43.2571775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2571858Z     
2025-05-07T20:32:43.2571948Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2572076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2572168Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2572248Z         x0 = x[:, :D]
2025-05-07T20:32:43.2572329Z         x1 = x[:, D:]
2025-05-07T20:32:43.2572402Z     
2025-05-07T20:32:43.2572486Z         if contiguous:
2025-05-07T20:32:43.2572583Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2572673Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2572745Z     
2025-05-07T20:32:43.2572838Z         if scale_ub is not None:
2025-05-07T20:32:43.2572950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2573086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2573167Z             )
2025-05-07T20:32:43.2573245Z         else:
2025-05-07T20:32:43.2573337Z             scale_ub_tensor = None
2025-05-07T20:32:43.2573412Z     
2025-05-07T20:32:43.2573545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2573635Z             op = silu_mul_quant
2025-05-07T20:32:43.2573723Z             if compiled:
2025-05-07T20:32:43.2573824Z                 op = torch.compile(op)
2025-05-07T20:32:43.2573934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2574005Z     
2025-05-07T20:32:43.2574098Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2574102Z 
2025-05-07T20:32:43.2574202Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2574328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2574433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2574581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2574949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2575042Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2575540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2575642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2576002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2576236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2576571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2576669Z     kernel = self.compile(
2025-05-07T20:32:43.2577048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2577224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2577351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2577356Z 
2025-05-07T20:32:43.2577608Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd920e6d0>
2025-05-07T20:32:43.2578429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2578932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd92d21f0>}
2025-05-07T20:32:43.2579671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2579906Z context = <triton._C.libtriton.ir.context object at 0x7fbfd92d0a70>
2025-05-07T20:32:43.2579910Z 
2025-05-07T20:32:43.2580077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2580354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2580463Z                            module_map=module_map)
2025-05-07T20:32:43.2580626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2580726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2580803Z E       ^
2025-05-07T20:32:43.2581207Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2581212Z 
2025-05-07T20:32:43.2581624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2581631Z 
2025-05-07T20:32:43.2581732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2581955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2582033Z     T=2048,
2025-05-07T20:32:43.2582109Z     D=5120,
2025-05-07T20:32:43.2582200Z     scale_ub=None,
2025-05-07T20:32:43.2582291Z     contiguous=False,
2025-05-07T20:32:43.2582376Z     compiled=True,
2025-05-07T20:32:43.2582448Z )
2025-05-07T20:32:43.2582664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2582842Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2582846Z 
2025-05-07T20:32:43.2582922Z     @given(
2025-05-07T20:32:43.2583038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2583139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2583252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2583441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2583560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2583633Z     )
2025-05-07T20:32:43.2583884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2583977Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2584058Z         self,
2025-05-07T20:32:43.2584136Z         T: int,
2025-05-07T20:32:43.2584213Z         D: int,
2025-05-07T20:32:43.2584310Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2584403Z         contiguous: bool,
2025-05-07T20:32:43.2584490Z         compiled: bool,
2025-05-07T20:32:43.2584568Z     ) -> None:
2025-05-07T20:32:43.2584664Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2584736Z     
2025-05-07T20:32:43.2584902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2584980Z     
2025-05-07T20:32:43.2585070Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2585200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2585292Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2585373Z         x0 = x[:, :D]
2025-05-07T20:32:43.2585455Z         x1 = x[:, D:]
2025-05-07T20:32:43.2585527Z     
2025-05-07T20:32:43.2585611Z         if contiguous:
2025-05-07T20:32:43.2585705Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2585879Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2585952Z     
2025-05-07T20:32:43.2586044Z         if scale_ub is not None:
2025-05-07T20:32:43.2586151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2586286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2586368Z             )
2025-05-07T20:32:43.2586444Z         else:
2025-05-07T20:32:43.2586542Z             scale_ub_tensor = None
2025-05-07T20:32:43.2586615Z     
2025-05-07T20:32:43.2586746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2586840Z             op = silu_mul_quant
2025-05-07T20:32:43.2586971Z             if compiled:
2025-05-07T20:32:43.2587071Z                 op = torch.compile(op)
2025-05-07T20:32:43.2587177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2587250Z     
2025-05-07T20:32:43.2587339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2587344Z 
2025-05-07T20:32:43.2587442Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2587573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2587673Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2587773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2588141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2588235Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2588721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2588816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2589178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2589404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2589750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2589845Z     kernel = self.compile(
2025-05-07T20:32:43.2590227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2590403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2590529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2590534Z 
2025-05-07T20:32:43.2590740Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd92f5760>
2025-05-07T20:32:43.2591556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2592075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd92d2f70>}
2025-05-07T20:32:43.2592826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2593018Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9116ab0>
2025-05-07T20:32:43.2593022Z 
2025-05-07T20:32:43.2593189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2593450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2593559Z                            module_map=module_map)
2025-05-07T20:32:43.2593726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2593827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2593903Z E       ^
2025-05-07T20:32:43.2594299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2594340Z 
2025-05-07T20:32:43.2594756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2594760Z 
2025-05-07T20:32:43.2594866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2595089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2595165Z     T=2048,
2025-05-07T20:32:43.2595244Z     D=5120,
2025-05-07T20:32:43.2595328Z     scale_ub=1200.0,
2025-05-07T20:32:43.2595412Z     contiguous=False,
2025-05-07T20:32:43.2595499Z     compiled=True,
2025-05-07T20:32:43.2595613Z )
2025-05-07T20:32:43.2595834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2596008Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2596013Z 
2025-05-07T20:32:43.2596090Z     @given(
2025-05-07T20:32:43.2596211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2596311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2596425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2596547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2596658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2596732Z     )
2025-05-07T20:32:43.2596979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2597071Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2597152Z         self,
2025-05-07T20:32:43.2597229Z         T: int,
2025-05-07T20:32:43.2597303Z         D: int,
2025-05-07T20:32:43.2597410Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2597498Z         contiguous: bool,
2025-05-07T20:32:43.2597584Z         compiled: bool,
2025-05-07T20:32:43.2597664Z     ) -> None:
2025-05-07T20:32:43.2597759Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2597832Z     
2025-05-07T20:32:43.2598006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2598080Z     
2025-05-07T20:32:43.2598170Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2598301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2598390Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2598475Z         x0 = x[:, :D]
2025-05-07T20:32:43.2598555Z         x1 = x[:, D:]
2025-05-07T20:32:43.2598627Z     
2025-05-07T20:32:43.2598713Z         if contiguous:
2025-05-07T20:32:43.2598803Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2598891Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2598967Z     
2025-05-07T20:32:43.2599102Z         if scale_ub is not None:
2025-05-07T20:32:43.2599213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2599350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2599425Z             )
2025-05-07T20:32:43.2599501Z         else:
2025-05-07T20:32:43.2599599Z             scale_ub_tensor = None
2025-05-07T20:32:43.2599678Z     
2025-05-07T20:32:43.2599806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2599900Z             op = silu_mul_quant
2025-05-07T20:32:43.2599985Z             if compiled:
2025-05-07T20:32:43.2600088Z                 op = torch.compile(op)
2025-05-07T20:32:43.2600192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2600264Z     
2025-05-07T20:32:43.2600356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2600361Z 
2025-05-07T20:32:43.2600458Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2600590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2600708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2600808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2601170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2601262Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2601844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2601945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2602299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2602524Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2602861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2602953Z     kernel = self.compile(
2025-05-07T20:32:43.2603384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2603562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2603686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2603693Z 
2025-05-07T20:32:43.2603904Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd9139a00>
2025-05-07T20:32:43.2604667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2605170Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd913c940>}
2025-05-07T20:32:43.2605908Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2606102Z context = <triton._C.libtriton.ir.context object at 0x7fbfd900f970>
2025-05-07T20:32:43.2606110Z 
2025-05-07T20:32:43.2606276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2606542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2606654Z                            module_map=module_map)
2025-05-07T20:32:43.2606816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2606913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2606991Z E       ^
2025-05-07T20:32:43.2607347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2607352Z 
2025-05-07T20:32:43.2607812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2607820Z 
2025-05-07T20:32:43.2607923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2608145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2608224Z     T=4096,
2025-05-07T20:32:43.2608304Z     D=5120,
2025-05-07T20:32:43.2608390Z     scale_ub=1200.0,
2025-05-07T20:32:43.2608481Z     contiguous=True,
2025-05-07T20:32:43.2608563Z     compiled=True,
2025-05-07T20:32:43.2608635Z )
2025-05-07T20:32:43.2608854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2609025Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2609029Z 
2025-05-07T20:32:43.2609110Z     @given(
2025-05-07T20:32:43.2609228Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2609325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2609444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2609565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2609679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2609757Z     )
2025-05-07T20:32:43.2610000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2610178Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2610259Z         self,
2025-05-07T20:32:43.2610335Z         T: int,
2025-05-07T20:32:43.2610413Z         D: int,
2025-05-07T20:32:43.2610510Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2610597Z         contiguous: bool,
2025-05-07T20:32:43.2610684Z         compiled: bool,
2025-05-07T20:32:43.2610761Z     ) -> None:
2025-05-07T20:32:43.2610853Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2610927Z     
2025-05-07T20:32:43.2611094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2611167Z     
2025-05-07T20:32:43.2611261Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2611456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2611546Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2611629Z         x0 = x[:, :D]
2025-05-07T20:32:43.2611709Z         x1 = x[:, D:]
2025-05-07T20:32:43.2611785Z     
2025-05-07T20:32:43.2611868Z         if contiguous:
2025-05-07T20:32:43.2611962Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2612054Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2612125Z     
2025-05-07T20:32:43.2612214Z         if scale_ub is not None:
2025-05-07T20:32:43.2612322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2612458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2612535Z             )
2025-05-07T20:32:43.2612615Z         else:
2025-05-07T20:32:43.2612708Z             scale_ub_tensor = None
2025-05-07T20:32:43.2612783Z     
2025-05-07T20:32:43.2612918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2613013Z             op = silu_mul_quant
2025-05-07T20:32:43.2613097Z             if compiled:
2025-05-07T20:32:43.2613200Z                 op = torch.compile(op)
2025-05-07T20:32:43.2613304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2613381Z     
2025-05-07T20:32:43.2613470Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2613476Z 
2025-05-07T20:32:43.2613574Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2613704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2613805Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2613905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2614275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2614368Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2614859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2615004Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2615365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2615591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2615940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2616037Z     kernel = self.compile(
2025-05-07T20:32:43.2616421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2616597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2616725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2616729Z 
2025-05-07T20:32:43.2616938Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd904cd30>
2025-05-07T20:32:43.2617704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2618264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9107790>}
2025-05-07T20:32:43.2619075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2619270Z context = <triton._C.libtriton.ir.context object at 0x7fbfd90d9a30>
2025-05-07T20:32:43.2619274Z 
2025-05-07T20:32:43.2619437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2619701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2619850Z                            module_map=module_map)
2025-05-07T20:32:43.2620009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2620111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2620187Z E       ^
2025-05-07T20:32:43.2620545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2620552Z 
2025-05-07T20:32:43.2620972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2620976Z 
2025-05-07T20:32:43.2621124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2621351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2621427Z     T=128,
2025-05-07T20:32:43.2621502Z     D=5120,
2025-05-07T20:32:43.2621586Z     scale_ub=1200.0,
2025-05-07T20:32:43.2621674Z     contiguous=False,
2025-05-07T20:32:43.2621762Z     compiled=True,
2025-05-07T20:32:43.2621837Z )
2025-05-07T20:32:43.2622053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2622222Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2622230Z 
2025-05-07T20:32:43.2622306Z     @given(
2025-05-07T20:32:43.2622428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2622529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2622641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2622758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2622874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2622948Z     )
2025-05-07T20:32:43.2623191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2623290Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2623369Z         self,
2025-05-07T20:32:43.2623446Z         T: int,
2025-05-07T20:32:43.2623574Z         D: int,
2025-05-07T20:32:43.2623672Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2623763Z         contiguous: bool,
2025-05-07T20:32:43.2623851Z         compiled: bool,
2025-05-07T20:32:43.2623928Z     ) -> None:
2025-05-07T20:32:43.2624024Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2624101Z     
2025-05-07T20:32:43.2624272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2624348Z     
2025-05-07T20:32:43.2624439Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2624563Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2624653Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2624734Z         x0 = x[:, :D]
2025-05-07T20:32:43.2624813Z         x1 = x[:, D:]
2025-05-07T20:32:43.2624891Z     
2025-05-07T20:32:43.2624974Z         if contiguous:
2025-05-07T20:32:43.2625065Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2625157Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2625232Z     
2025-05-07T20:32:43.2625330Z         if scale_ub is not None:
2025-05-07T20:32:43.2625435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2625569Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2625650Z             )
2025-05-07T20:32:43.2625727Z         else:
2025-05-07T20:32:43.2625904Z             scale_ub_tensor = None
2025-05-07T20:32:43.2625980Z     
2025-05-07T20:32:43.2626109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2626199Z             op = silu_mul_quant
2025-05-07T20:32:43.2626291Z             if compiled:
2025-05-07T20:32:43.2626390Z                 op = torch.compile(op)
2025-05-07T20:32:43.2626496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2626570Z     
2025-05-07T20:32:43.2626661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2626665Z 
2025-05-07T20:32:43.2626768Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2626898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2627039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2627143Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2627506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2627603Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2628094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2628189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2628547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2628770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2629104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2629204Z     kernel = self.compile(
2025-05-07T20:32:43.2629587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2629768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2629896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2629905Z 
2025-05-07T20:32:43.2630110Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ee2d30>
2025-05-07T20:32:43.2630917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2631430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8f0e0d0>}
2025-05-07T20:32:43.2632216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2632412Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8f45430>
2025-05-07T20:32:43.2632418Z 
2025-05-07T20:32:43.2632588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2632852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2632957Z                            module_map=module_map)
2025-05-07T20:32:43.2633121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2633218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2633294Z E       ^
2025-05-07T20:32:43.2633648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2633653Z 
2025-05-07T20:32:43.2634077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2634082Z 
2025-05-07T20:32:43.2634186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2634407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2634562Z     T=16384,
2025-05-07T20:32:43.2634644Z     D=7168,
2025-05-07T20:32:43.2634726Z     scale_ub=1200.0,
2025-05-07T20:32:43.2634812Z     contiguous=True,
2025-05-07T20:32:43.2634896Z     compiled=True,
2025-05-07T20:32:43.2634968Z )
2025-05-07T20:32:43.2635182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2635358Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2635362Z 
2025-05-07T20:32:43.2635439Z     @given(
2025-05-07T20:32:43.2635558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2635656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2635817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2635936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2636048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2636121Z     )
2025-05-07T20:32:43.2636372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2636471Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2636546Z         self,
2025-05-07T20:32:43.2636624Z         T: int,
2025-05-07T20:32:43.2636699Z         D: int,
2025-05-07T20:32:43.2636796Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2636887Z         contiguous: bool,
2025-05-07T20:32:43.2636973Z         compiled: bool,
2025-05-07T20:32:43.2637054Z     ) -> None:
2025-05-07T20:32:43.2641648Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2641736Z     
2025-05-07T20:32:43.2641921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2642006Z     
2025-05-07T20:32:43.2642103Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2642230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2642323Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2642403Z         x0 = x[:, :D]
2025-05-07T20:32:43.2642484Z         x1 = x[:, D:]
2025-05-07T20:32:43.2642564Z     
2025-05-07T20:32:43.2642655Z         if contiguous:
2025-05-07T20:32:43.2642746Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2642841Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2642917Z     
2025-05-07T20:32:43.2643012Z         if scale_ub is not None:
2025-05-07T20:32:43.2643116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2643253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2643335Z             )
2025-05-07T20:32:43.2643414Z         else:
2025-05-07T20:32:43.2643508Z             scale_ub_tensor = None
2025-05-07T20:32:43.2643584Z     
2025-05-07T20:32:43.2643829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2643928Z             op = silu_mul_quant
2025-05-07T20:32:43.2644018Z             if compiled:
2025-05-07T20:32:43.2644119Z                 op = torch.compile(op)
2025-05-07T20:32:43.2644224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2644300Z     
2025-05-07T20:32:43.2644397Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2644402Z 
2025-05-07T20:32:43.2644504Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2644636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2644736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2644839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2645207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2645300Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2645809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2645908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2646268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2646554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2646974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2647074Z     kernel = self.compile(
2025-05-07T20:32:43.2647458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2647638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2647772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2647776Z 
2025-05-07T20:32:43.2647987Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e8cb20>
2025-05-07T20:32:43.2648831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2649349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8f0ed30>}
2025-05-07T20:32:43.2650090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2650284Z context = <triton._C.libtriton.ir.context object at 0x7fbfd90a73f0>
2025-05-07T20:32:43.2650289Z 
2025-05-07T20:32:43.2650454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2650734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2650866Z                            module_map=module_map)
2025-05-07T20:32:43.2651054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2651154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2651236Z E       ^
2025-05-07T20:32:43.2651600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2651605Z 
2025-05-07T20:32:43.2652021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2652025Z 
2025-05-07T20:32:43.2652130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2652350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2652428Z     T=16384,
2025-05-07T20:32:43.2652508Z     D=5120,
2025-05-07T20:32:43.2652593Z     scale_ub=1200.0,
2025-05-07T20:32:43.2652725Z     contiguous=True,
2025-05-07T20:32:43.2652813Z     compiled=False,
2025-05-07T20:32:43.2652892Z )
2025-05-07T20:32:43.2653108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2653288Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2653295Z 
2025-05-07T20:32:43.2653376Z     @given(
2025-05-07T20:32:43.2653494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2653596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2653710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2653829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2653946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2654021Z     )
2025-05-07T20:32:43.2654270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2654364Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2654443Z         self,
2025-05-07T20:32:43.2654529Z         T: int,
2025-05-07T20:32:43.2654610Z         D: int,
2025-05-07T20:32:43.2654713Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2654806Z         contiguous: bool,
2025-05-07T20:32:43.2654892Z         compiled: bool,
2025-05-07T20:32:43.2654970Z     ) -> None:
2025-05-07T20:32:43.2655147Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2655225Z     
2025-05-07T20:32:43.2655400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2655476Z     
2025-05-07T20:32:43.2655570Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2655702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2655792Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2655876Z         x0 = x[:, :D]
2025-05-07T20:32:43.2655960Z         x1 = x[:, D:]
2025-05-07T20:32:43.2656033Z     
2025-05-07T20:32:43.2656115Z         if contiguous:
2025-05-07T20:32:43.2656212Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2656348Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2656421Z     
2025-05-07T20:32:43.2656517Z         if scale_ub is not None:
2025-05-07T20:32:43.2656623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2656761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2656841Z             )
2025-05-07T20:32:43.2656921Z         else:
2025-05-07T20:32:43.2657017Z             scale_ub_tensor = None
2025-05-07T20:32:43.2657090Z     
2025-05-07T20:32:43.2657223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2657319Z             op = silu_mul_quant
2025-05-07T20:32:43.2657407Z             if compiled:
2025-05-07T20:32:43.2657506Z                 op = torch.compile(op)
2025-05-07T20:32:43.2657614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2657687Z     
2025-05-07T20:32:43.2657779Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2657788Z 
2025-05-07T20:32:43.2657888Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2658019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2658125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2658232Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2658737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2658839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2659199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2659423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2659871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2660003Z     kernel = self.compile(
2025-05-07T20:32:43.2660605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2660885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2661171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2661178Z 
2025-05-07T20:32:43.2661477Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ec7b20>
2025-05-07T20:32:43.2662317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2662827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd9052700>}
2025-05-07T20:32:43.2663579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2663781Z context = <triton._C.libtriton.ir.context object at 0x7fbfd91c56f0>
2025-05-07T20:32:43.2663786Z 
2025-05-07T20:32:43.2663957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2664281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2664428Z                            module_map=module_map)
2025-05-07T20:32:43.2664592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2664690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2664772Z E       ^
2025-05-07T20:32:43.2665127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2665132Z 
2025-05-07T20:32:43.2665551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2665601Z 
2025-05-07T20:32:43.2665708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2665929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2666011Z     T=1,
2025-05-07T20:32:43.2666087Z     D=7168,
2025-05-07T20:32:43.2666172Z     scale_ub=1200.0,
2025-05-07T20:32:43.2666271Z     contiguous=False,
2025-05-07T20:32:43.2666356Z     compiled=False,
2025-05-07T20:32:43.2666436Z )
2025-05-07T20:32:43.2666652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2666818Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2666822Z 
2025-05-07T20:32:43.2666903Z     @given(
2025-05-07T20:32:43.2667021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2667119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2667238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2667360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2667479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2667556Z     )
2025-05-07T20:32:43.2667799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2667899Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2667977Z         self,
2025-05-07T20:32:43.2668057Z         T: int,
2025-05-07T20:32:43.2668137Z         D: int,
2025-05-07T20:32:43.2668236Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2668327Z         contiguous: bool,
2025-05-07T20:32:43.2668418Z         compiled: bool,
2025-05-07T20:32:43.2668496Z     ) -> None:
2025-05-07T20:32:43.2668591Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2668667Z     
2025-05-07T20:32:43.2668835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2668908Z     
2025-05-07T20:32:43.2669007Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2669134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2669275Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2669358Z         x0 = x[:, :D]
2025-05-07T20:32:43.2669439Z         x1 = x[:, D:]
2025-05-07T20:32:43.2669516Z     
2025-05-07T20:32:43.2669599Z         if contiguous:
2025-05-07T20:32:43.2669690Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2669789Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2669864Z     
2025-05-07T20:32:43.2669954Z         if scale_ub is not None:
2025-05-07T20:32:43.2670063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2670199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2670276Z             )
2025-05-07T20:32:43.2670356Z         else:
2025-05-07T20:32:43.2670451Z             scale_ub_tensor = None
2025-05-07T20:32:43.2670524Z     
2025-05-07T20:32:43.2670664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2670754Z             op = silu_mul_quant
2025-05-07T20:32:43.2670845Z             if compiled:
2025-05-07T20:32:43.2670953Z                 op = torch.compile(op)
2025-05-07T20:32:43.2671062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2671137Z     
2025-05-07T20:32:43.2671230Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2671234Z 
2025-05-07T20:32:43.2671332Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2671543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2671649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2671748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2672252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2672350Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2672713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2672942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2673323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2673426Z     kernel = self.compile(
2025-05-07T20:32:43.2673806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2673993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2674121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2674125Z 
2025-05-07T20:32:43.2674329Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd90a22b0>
2025-05-07T20:32:43.2675097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2675610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd900c0d0>}
2025-05-07T20:32:43.2676367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2676563Z context = <triton._C.libtriton.ir.context object at 0x7fbfd9009b70>
2025-05-07T20:32:43.2676568Z 
2025-05-07T20:32:43.2676737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2677006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2677115Z                            module_map=module_map)
2025-05-07T20:32:43.2677280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2677376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2677461Z E       ^
2025-05-07T20:32:43.2677861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2677867Z 
2025-05-07T20:32:43.2678286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2678294Z 
2025-05-07T20:32:43.2678401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2678621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2678703Z     T=4096,
2025-05-07T20:32:43.2678780Z     D=7168,
2025-05-07T20:32:43.2678862Z     scale_ub=1200.0,
2025-05-07T20:32:43.2678953Z     contiguous=False,
2025-05-07T20:32:43.2679036Z     compiled=True,
2025-05-07T20:32:43.2679113Z )
2025-05-07T20:32:43.2679330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2679506Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2679513Z 
2025-05-07T20:32:43.2679595Z     @given(
2025-05-07T20:32:43.2679715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2679813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2679932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2680113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2680264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2680342Z     )
2025-05-07T20:32:43.2680588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2680685Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2680761Z         self,
2025-05-07T20:32:43.2680838Z         T: int,
2025-05-07T20:32:43.2680918Z         D: int,
2025-05-07T20:32:43.2681017Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2681107Z         contiguous: bool,
2025-05-07T20:32:43.2681198Z         compiled: bool,
2025-05-07T20:32:43.2681275Z     ) -> None:
2025-05-07T20:32:43.2681414Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2681490Z     
2025-05-07T20:32:43.2681656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2681731Z     
2025-05-07T20:32:43.2681826Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2681952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2682053Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2682135Z         x0 = x[:, :D]
2025-05-07T20:32:43.2682218Z         x1 = x[:, D:]
2025-05-07T20:32:43.2682297Z     
2025-05-07T20:32:43.2682384Z         if contiguous:
2025-05-07T20:32:43.2682479Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2682569Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2682642Z     
2025-05-07T20:32:43.2682734Z         if scale_ub is not None:
2025-05-07T20:32:43.2682843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2682979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2683055Z             )
2025-05-07T20:32:43.2683141Z         else:
2025-05-07T20:32:43.2683234Z             scale_ub_tensor = None
2025-05-07T20:32:43.2683306Z     
2025-05-07T20:32:43.2683441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2683532Z             op = silu_mul_quant
2025-05-07T20:32:43.2683620Z             if compiled:
2025-05-07T20:32:43.2683725Z                 op = torch.compile(op)
2025-05-07T20:32:43.2683831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2683905Z     
2025-05-07T20:32:43.2683995Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2683999Z 
2025-05-07T20:32:43.2684098Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2684228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2684329Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2684429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2684841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2684938Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2685429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2685528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2685893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2686120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2686462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2686563Z     kernel = self.compile(
2025-05-07T20:32:43.2686945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2687121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2687261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2687265Z 
2025-05-07T20:32:43.2687470Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8ecc550>
2025-05-07T20:32:43.2688280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2688831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd900cdc0>}
2025-05-07T20:32:43.2689579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2689778Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8e13e70>
2025-05-07T20:32:43.2689821Z 
2025-05-07T20:32:43.2689993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2690262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2690369Z                            module_map=module_map)
2025-05-07T20:32:43.2690537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2690644Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2690741Z E       ^
2025-05-07T20:32:43.2691118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2691126Z 
2025-05-07T20:32:43.2691538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2691543Z 
2025-05-07T20:32:43.2691645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2691877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2691955Z     T=128,
2025-05-07T20:32:43.2692031Z     D=7168,
2025-05-07T20:32:43.2692119Z     scale_ub=1200.0,
2025-05-07T20:32:43.2692205Z     contiguous=False,
2025-05-07T20:32:43.2692287Z     compiled=True,
2025-05-07T20:32:43.2692365Z )
2025-05-07T20:32:43.2692588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2692764Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2692768Z 
2025-05-07T20:32:43.2692846Z     @given(
2025-05-07T20:32:43.2692969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2693073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2693186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2693304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2693423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2693498Z     )
2025-05-07T20:32:43.2693796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2693893Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2693971Z         self,
2025-05-07T20:32:43.2694055Z         T: int,
2025-05-07T20:32:43.2694130Z         D: int,
2025-05-07T20:32:43.2694232Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2694328Z         contiguous: bool,
2025-05-07T20:32:43.2694413Z         compiled: bool,
2025-05-07T20:32:43.2694491Z     ) -> None:
2025-05-07T20:32:43.2694587Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2694660Z     
2025-05-07T20:32:43.2694828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2694905Z     
2025-05-07T20:32:43.2694996Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2695122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2695214Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2695295Z         x0 = x[:, :D]
2025-05-07T20:32:43.2695388Z         x1 = x[:, D:]
2025-05-07T20:32:43.2695463Z     
2025-05-07T20:32:43.2695546Z         if contiguous:
2025-05-07T20:32:43.2695641Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2695730Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2695803Z     
2025-05-07T20:32:43.2695898Z         if scale_ub is not None:
2025-05-07T20:32:43.2696088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2696228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2696307Z             )
2025-05-07T20:32:43.2696385Z         else:
2025-05-07T20:32:43.2696478Z             scale_ub_tensor = None
2025-05-07T20:32:43.2696558Z     
2025-05-07T20:32:43.2696692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2696781Z             op = silu_mul_quant
2025-05-07T20:32:43.2696872Z             if compiled:
2025-05-07T20:32:43.2696971Z                 op = torch.compile(op)
2025-05-07T20:32:43.2697080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2697203Z     
2025-05-07T20:32:43.2697294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2697298Z 
2025-05-07T20:32:43.2697403Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2697531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2697633Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2697739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2698104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2698201Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2698694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2698792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2699157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2699392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2699728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2699824Z     kernel = self.compile(
2025-05-07T20:32:43.2700216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2700395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2700528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2700532Z 
2025-05-07T20:32:43.2700740Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e2ca90>
2025-05-07T20:32:43.2701576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2702142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8e9a940>}
2025-05-07T20:32:43.2702888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2703092Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8fa5cb0>
2025-05-07T20:32:43.2703097Z 
2025-05-07T20:32:43.2703266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2703538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2703644Z                            module_map=module_map)
2025-05-07T20:32:43.2703805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2703909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2703991Z E       ^
2025-05-07T20:32:43.2704355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2704360Z 
2025-05-07T20:32:43.2704810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2704849Z 
2025-05-07T20:32:43.2704954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2705178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2705254Z     T=2048,
2025-05-07T20:32:43.2705331Z     D=7168,
2025-05-07T20:32:43.2705418Z     scale_ub=None,
2025-05-07T20:32:43.2705504Z     contiguous=True,
2025-05-07T20:32:43.2705588Z     compiled=True,
2025-05-07T20:32:43.2705664Z )
2025-05-07T20:32:43.2705883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2706062Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2706106Z 
2025-05-07T20:32:43.2706183Z     @given(
2025-05-07T20:32:43.2706301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2706406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2706521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2706644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2706768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2706841Z     )
2025-05-07T20:32:43.2707094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2707187Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2707262Z         self,
2025-05-07T20:32:43.2707344Z         T: int,
2025-05-07T20:32:43.2707421Z         D: int,
2025-05-07T20:32:43.2707519Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2707610Z         contiguous: bool,
2025-05-07T20:32:43.2707695Z         compiled: bool,
2025-05-07T20:32:43.2707778Z     ) -> None:
2025-05-07T20:32:43.2707876Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2707952Z     
2025-05-07T20:32:43.2708121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2708200Z     
2025-05-07T20:32:43.2708292Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2708430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2708520Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2708601Z         x0 = x[:, :D]
2025-05-07T20:32:43.2708685Z         x1 = x[:, D:]
2025-05-07T20:32:43.2708761Z     
2025-05-07T20:32:43.2708844Z         if contiguous:
2025-05-07T20:32:43.2708940Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2709028Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2709101Z     
2025-05-07T20:32:43.2709196Z         if scale_ub is not None:
2025-05-07T20:32:43.2709301Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2709435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2709586Z             )
2025-05-07T20:32:43.2709665Z         else:
2025-05-07T20:32:43.2709762Z             scale_ub_tensor = None
2025-05-07T20:32:43.2709836Z     
2025-05-07T20:32:43.2709965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2710060Z             op = silu_mul_quant
2025-05-07T20:32:43.2710152Z             if compiled:
2025-05-07T20:32:43.2710252Z                 op = torch.compile(op)
2025-05-07T20:32:43.2710363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2710436Z     
2025-05-07T20:32:43.2710528Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2710532Z 
2025-05-07T20:32:43.2710635Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2710764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2710865Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2710971Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2711338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2711438Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2711927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2712100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2712470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2712697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2713035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2713128Z     kernel = self.compile(
2025-05-07T20:32:43.2713504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2713683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2713845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2713849Z 
2025-05-07T20:32:43.2714058Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8e2f820>
2025-05-07T20:32:43.2714828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2715336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8c4f550>}
2025-05-07T20:32:43.2716074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2716270Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8f79d70>
2025-05-07T20:32:43.2716275Z 
2025-05-07T20:32:43.2716445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2716707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2716821Z                            module_map=module_map)
2025-05-07T20:32:43.2716990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2717089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2717167Z E       ^
2025-05-07T20:32:43.2717520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2717525Z 
2025-05-07T20:32:43.2717933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2717937Z 
2025-05-07T20:32:43.2718042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2718304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2718383Z     T=16384,
2025-05-07T20:32:43.2718463Z     D=5120,
2025-05-07T20:32:43.2718544Z     scale_ub=None,
2025-05-07T20:32:43.2718631Z     contiguous=False,
2025-05-07T20:32:43.2718721Z     compiled=False,
2025-05-07T20:32:43.2718802Z )
2025-05-07T20:32:43.2719020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2719198Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2719202Z 
2025-05-07T20:32:43.2719280Z     @given(
2025-05-07T20:32:43.2719400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2719502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2719615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2719737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2719849Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2719929Z     )
2025-05-07T20:32:43.2720176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2720269Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2720349Z         self,
2025-05-07T20:32:43.2720425Z         T: int,
2025-05-07T20:32:43.2720546Z         D: int,
2025-05-07T20:32:43.2720684Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2720798Z         contiguous: bool,
2025-05-07T20:32:43.2720890Z         compiled: bool,
2025-05-07T20:32:43.2720993Z     ) -> None:
2025-05-07T20:32:43.2721087Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2721159Z     
2025-05-07T20:32:43.2721329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2721408Z     
2025-05-07T20:32:43.2721499Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2721628Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2723417Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2723467Z 
2025-05-07T20:32:43.2723591Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2723596Z 
2025-05-07T20:32:43.2723701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2723927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2724005Z     T=4096,
2025-05-07T20:32:43.2724081Z     D=7168,
2025-05-07T20:32:43.2724166Z     scale_ub=1200.0,
2025-05-07T20:32:43.2724250Z     contiguous=True,
2025-05-07T20:32:43.2724338Z     compiled=True,
2025-05-07T20:32:43.2724416Z )
2025-05-07T20:32:43.2724632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2724804Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2724814Z 
2025-05-07T20:32:43.2724894Z     @given(
2025-05-07T20:32:43.2725016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2725120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2725233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2725348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2725464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2725537Z     )
2025-05-07T20:32:43.2725787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2725884Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2725959Z         self,
2025-05-07T20:32:43.2726081Z         T: int,
2025-05-07T20:32:43.2726163Z         D: int,
2025-05-07T20:32:43.2726264Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2726356Z         contiguous: bool,
2025-05-07T20:32:43.2726441Z         compiled: bool,
2025-05-07T20:32:43.2726518Z     ) -> None:
2025-05-07T20:32:43.2726617Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2726692Z     
2025-05-07T20:32:43.2726861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2726939Z     
2025-05-07T20:32:43.2727031Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2727156Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2728915Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2728923Z 
2025-05-07T20:32:43.2729041Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2729122Z 
2025-05-07T20:32:43.2729229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2729453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2729535Z     T=16384,
2025-05-07T20:32:43.2729611Z     D=7168,
2025-05-07T20:32:43.2729691Z     scale_ub=None,
2025-05-07T20:32:43.2729780Z     contiguous=False,
2025-05-07T20:32:43.2729864Z     compiled=False,
2025-05-07T20:32:43.2729936Z )
2025-05-07T20:32:43.2730155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2730330Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2730378Z 
2025-05-07T20:32:43.2730456Z     @given(
2025-05-07T20:32:43.2730577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2730677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2730793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2730915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2731029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2731108Z     )
2025-05-07T20:32:43.2731359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2731452Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2731536Z         self,
2025-05-07T20:32:43.2731613Z         T: int,
2025-05-07T20:32:43.2731689Z         D: int,
2025-05-07T20:32:43.2731789Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2731877Z         contiguous: bool,
2025-05-07T20:32:43.2731962Z         compiled: bool,
2025-05-07T20:32:43.2732042Z     ) -> None:
2025-05-07T20:32:43.2732142Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2732219Z     
2025-05-07T20:32:43.2732385Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2734149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2734163Z 
2025-05-07T20:32:43.2734281Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2734286Z 
2025-05-07T20:32:43.2734387Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2734661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2734738Z     T=2048,
2025-05-07T20:32:43.2734814Z     D=7168,
2025-05-07T20:32:43.2734904Z     scale_ub=1200.0,
2025-05-07T20:32:43.2734988Z     contiguous=True,
2025-05-07T20:32:43.2735069Z     compiled=True,
2025-05-07T20:32:43.2735147Z )
2025-05-07T20:32:43.2735367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2735544Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2735548Z 
2025-05-07T20:32:43.2735625Z     @given(
2025-05-07T20:32:43.2735742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2735843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2735955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2736072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2736186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2736265Z     )
2025-05-07T20:32:43.2736514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2736610Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2736685Z         self,
2025-05-07T20:32:43.2736764Z         T: int,
2025-05-07T20:32:43.2736839Z         D: int,
2025-05-07T20:32:43.2737016Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2737108Z         contiguous: bool,
2025-05-07T20:32:43.2737193Z         compiled: bool,
2025-05-07T20:32:43.2737271Z     ) -> None:
2025-05-07T20:32:43.2737368Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2737439Z     
2025-05-07T20:32:43.2737605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2737680Z     
2025-05-07T20:32:43.2737772Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2737897Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2739647Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2739716Z 
2025-05-07T20:32:43.2739837Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2739846Z 
2025-05-07T20:32:43.2739947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2740469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2740590Z     T=2048,
2025-05-07T20:32:43.2740679Z     D=7168,
2025-05-07T20:32:43.2740760Z     scale_ub=None,
2025-05-07T20:32:43.2740848Z     contiguous=True,
2025-05-07T20:32:43.2740931Z     compiled=False,
2025-05-07T20:32:43.2741009Z )
2025-05-07T20:32:43.2741269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2741441Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2741446Z 
2025-05-07T20:32:43.2741526Z     @given(
2025-05-07T20:32:43.2741651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2741750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2741867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2741981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2742092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2742171Z     )
2025-05-07T20:32:43.2742415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2742510Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2742591Z         self,
2025-05-07T20:32:43.2742667Z         T: int,
2025-05-07T20:32:43.2742838Z         D: int,
2025-05-07T20:32:43.2742942Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2743032Z         contiguous: bool,
2025-05-07T20:32:43.2743122Z         compiled: bool,
2025-05-07T20:32:43.2743199Z     ) -> None:
2025-05-07T20:32:43.2743294Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2743373Z     
2025-05-07T20:32:43.2743541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2743615Z     
2025-05-07T20:32:43.2743713Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.2745491Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2745499Z 
2025-05-07T20:32:43.2745619Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.2745623Z 
2025-05-07T20:32:43.2745723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2746057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2746140Z     T=1,
2025-05-07T20:32:43.2746215Z     D=7168,
2025-05-07T20:32:43.2746302Z     scale_ub=1200.0,
2025-05-07T20:32:43.2746386Z     contiguous=True,
2025-05-07T20:32:43.2746471Z     compiled=False,
2025-05-07T20:32:43.2746547Z )
2025-05-07T20:32:43.2746763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2746930Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2746935Z 
2025-05-07T20:32:43.2747017Z     @given(
2025-05-07T20:32:43.2747136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2747295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2747414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2747529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2747643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2747722Z     )
2025-05-07T20:32:43.2747972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2748067Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2748144Z         self,
2025-05-07T20:32:43.2748220Z         T: int,
2025-05-07T20:32:43.2748299Z         D: int,
2025-05-07T20:32:43.2748395Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2748485Z         contiguous: bool,
2025-05-07T20:32:43.2748572Z         compiled: bool,
2025-05-07T20:32:43.2748649Z     ) -> None:
2025-05-07T20:32:43.2748744Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2748826Z     
2025-05-07T20:32:43.2748996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2749075Z     
2025-05-07T20:32:43.2749167Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2749292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2749388Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2749469Z         x0 = x[:, :D]
2025-05-07T20:32:43.2749554Z         x1 = x[:, D:]
2025-05-07T20:32:43.2749630Z     
2025-05-07T20:32:43.2749713Z         if contiguous:
2025-05-07T20:32:43.2749806Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2749900Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2749972Z     
2025-05-07T20:32:43.2750062Z         if scale_ub is not None:
2025-05-07T20:32:43.2750170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2750306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2750381Z             )
2025-05-07T20:32:43.2750463Z         else:
2025-05-07T20:32:43.2750560Z             scale_ub_tensor = None
2025-05-07T20:32:43.2750687Z     
2025-05-07T20:32:43.2750865Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2750990Z             op = silu_mul_quant
2025-05-07T20:32:43.2751118Z             if compiled:
2025-05-07T20:32:43.2751256Z                 op = torch.compile(op)
2025-05-07T20:32:43.2751436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2751911Z     
2025-05-07T20:32:43.2752170Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2752406Z 
2025-05-07T20:32:43.2752557Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2752953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2753392Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2753760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2754835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2755931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2756820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2757921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2759096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2759903Z     kernel = self.compile(
2025-05-07T20:32:43.2760522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2761174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2761571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2761799Z 
2025-05-07T20:32:43.2762016Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8b53310>
2025-05-07T20:32:43.2763120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2764552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8d50040>}
2025-05-07T20:32:43.2765913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2766931Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8d511f0>
2025-05-07T20:32:43.2772113Z 
2025-05-07T20:32:43.2772307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2772841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2773315Z                            module_map=module_map)
2025-05-07T20:32:43.2773692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2774049Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2774311Z E       ^
2025-05-07T20:32:43.2774771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2775229Z 
2025-05-07T20:32:43.2775654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2776171Z 
2025-05-07T20:32:43.2776272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2776684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2777083Z     T=128,
2025-05-07T20:32:43.2777263Z     D=5120,
2025-05-07T20:32:43.2777458Z     scale_ub=None,
2025-05-07T20:32:43.2777670Z     contiguous=True,
2025-05-07T20:32:43.2777887Z     compiled=False,
2025-05-07T20:32:43.2778100Z )
2025-05-07T20:32:43.2778483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2778976Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2779251Z 
2025-05-07T20:32:43.2779330Z     @given(
2025-05-07T20:32:43.2779560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2779884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2780194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2780531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2780859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2781231Z     )
2025-05-07T20:32:43.2781583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2782030Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2782269Z         self,
2025-05-07T20:32:43.2782461Z         T: int,
2025-05-07T20:32:43.2782659Z         D: int,
2025-05-07T20:32:43.2782884Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2783159Z         contiguous: bool,
2025-05-07T20:32:43.2783396Z         compiled: bool,
2025-05-07T20:32:43.2783613Z     ) -> None:
2025-05-07T20:32:43.2783827Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2784072Z     
2025-05-07T20:32:43.2784387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2784772Z     
2025-05-07T20:32:43.2784972Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2785259Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2785568Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2785811Z         x0 = x[:, :D]
2025-05-07T20:32:43.2786032Z         x1 = x[:, D:]
2025-05-07T20:32:43.2786237Z     
2025-05-07T20:32:43.2786424Z         if contiguous:
2025-05-07T20:32:43.2786654Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2786911Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2787154Z     
2025-05-07T20:32:43.2787346Z         if scale_ub is not None:
2025-05-07T20:32:43.2787662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2788001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2788310Z             )
2025-05-07T20:32:43.2788500Z         else:
2025-05-07T20:32:43.2788708Z             scale_ub_tensor = None
2025-05-07T20:32:43.2788967Z     
2025-05-07T20:32:43.2789197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2789519Z             op = silu_mul_quant
2025-05-07T20:32:43.2789771Z             if compiled:
2025-05-07T20:32:43.2790023Z                 op = torch.compile(op)
2025-05-07T20:32:43.2790321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2790600Z     
2025-05-07T20:32:43.2790827Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2791012Z 
2025-05-07T20:32:43.2791110Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2791410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2791755Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2792032Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2792717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2793408Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2793952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2794643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2795309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2795852Z     kernel = self.compile(
2025-05-07T20:32:43.2796390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2797051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2797501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2797736Z 
2025-05-07T20:32:43.2797950Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8d75e20>
2025-05-07T20:32:43.2799039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2800421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8d50a60>}
2025-05-07T20:32:43.2801762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2802793Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8bb99b0>
2025-05-07T20:32:43.2803089Z 
2025-05-07T20:32:43.2803265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2803793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2804297Z                            module_map=module_map)
2025-05-07T20:32:43.2804724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2805070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2805325Z E       ^
2025-05-07T20:32:43.2805786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2806231Z 
2025-05-07T20:32:43.2806656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2807173Z 
2025-05-07T20:32:43.2807278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2807736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2808134Z     T=128,
2025-05-07T20:32:43.2808322Z     D=7168,
2025-05-07T20:32:43.2808511Z     scale_ub=None,
2025-05-07T20:32:43.2808726Z     contiguous=True,
2025-05-07T20:32:43.2808946Z     compiled=False,
2025-05-07T20:32:43.2809156Z )
2025-05-07T20:32:43.2809474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2809966Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2810235Z 
2025-05-07T20:32:43.2810312Z     @given(
2025-05-07T20:32:43.2810540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2810859Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2811168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2811498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2811827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2812115Z     )
2025-05-07T20:32:43.2812471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2812925Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2813161Z         self,
2025-05-07T20:32:43.2813353Z         T: int,
2025-05-07T20:32:43.2813550Z         D: int,
2025-05-07T20:32:43.2813770Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2814041Z         contiguous: bool,
2025-05-07T20:32:43.2814281Z         compiled: bool,
2025-05-07T20:32:43.2814499Z     ) -> None:
2025-05-07T20:32:43.2814711Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2814955Z     
2025-05-07T20:32:43.2815221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2815565Z     
2025-05-07T20:32:43.2815764Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2816050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2816361Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2816603Z         x0 = x[:, :D]
2025-05-07T20:32:43.2816863Z         x1 = x[:, D:]
2025-05-07T20:32:43.2817074Z     
2025-05-07T20:32:43.2817260Z         if contiguous:
2025-05-07T20:32:43.2817489Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2817746Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2817985Z     
2025-05-07T20:32:43.2818172Z         if scale_ub is not None:
2025-05-07T20:32:43.2818446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2818782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2819086Z             )
2025-05-07T20:32:43.2819277Z         else:
2025-05-07T20:32:43.2819488Z             scale_ub_tensor = None
2025-05-07T20:32:43.2819742Z     
2025-05-07T20:32:43.2819967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2820282Z             op = silu_mul_quant
2025-05-07T20:32:43.2820534Z             if compiled:
2025-05-07T20:32:43.2820780Z                 op = torch.compile(op)
2025-05-07T20:32:43.2821135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2821414Z     
2025-05-07T20:32:43.2821603Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2821773Z 
2025-05-07T20:32:43.2821872Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2822164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2822578Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2822856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2823545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2824231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2824761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2825441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2826111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2826696Z     kernel = self.compile(
2025-05-07T20:32:43.2827231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2827882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2828284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2828515Z 
2025-05-07T20:32:43.2828730Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8bc56d0>
2025-05-07T20:32:43.2829804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2831176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8cd1790>}
2025-05-07T20:32:43.2832521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2833532Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8beae30>
2025-05-07T20:32:43.2833824Z 
2025-05-07T20:32:43.2833994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2834515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2834977Z                            module_map=module_map)
2025-05-07T20:32:43.2835338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2835685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2835947Z E       ^
2025-05-07T20:32:43.2836452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2836913Z 
2025-05-07T20:32:43.2837326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2837840Z 
2025-05-07T20:32:43.2837944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2838360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2838761Z     T=2048,
2025-05-07T20:32:43.2838943Z     D=7168,
2025-05-07T20:32:43.2839133Z     scale_ub=1200.0,
2025-05-07T20:32:43.2839359Z     contiguous=True,
2025-05-07T20:32:43.2839577Z     compiled=False,
2025-05-07T20:32:43.2839777Z )
2025-05-07T20:32:43.2840464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2841116Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2841396Z 
2025-05-07T20:32:43.2841473Z     @given(
2025-05-07T20:32:43.2841699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2842026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2842332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2842664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2842998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2843386Z     )
2025-05-07T20:32:43.2843793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2844242Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2844481Z         self,
2025-05-07T20:32:43.2844673Z         T: int,
2025-05-07T20:32:43.2844867Z         D: int,
2025-05-07T20:32:43.2845088Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2845358Z         contiguous: bool,
2025-05-07T20:32:43.2845595Z         compiled: bool,
2025-05-07T20:32:43.2845816Z     ) -> None:
2025-05-07T20:32:43.2846030Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2846272Z     
2025-05-07T20:32:43.2846542Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2848639Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2850515Z 
2025-05-07T20:32:43.2850637Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2850855Z 
2025-05-07T20:32:43.2850956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2851373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2851768Z     T=1,
2025-05-07T20:32:43.2851952Z     D=5120,
2025-05-07T20:32:43.2852143Z     scale_ub=1200.0,
2025-05-07T20:32:43.2852359Z     contiguous=True,
2025-05-07T20:32:43.2852582Z     compiled=False,
2025-05-07T20:32:43.2852785Z )
2025-05-07T20:32:43.2853095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2853586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2853852Z 
2025-05-07T20:32:43.2853932Z     @given(
2025-05-07T20:32:43.2854152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2854467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2854770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2855099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2855423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2855710Z     )
2025-05-07T20:32:43.2856058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2856568Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2856817Z         self,
2025-05-07T20:32:43.2857013Z         T: int,
2025-05-07T20:32:43.2857206Z         D: int,
2025-05-07T20:32:43.2857421Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2857696Z         contiguous: bool,
2025-05-07T20:32:43.2857934Z         compiled: bool,
2025-05-07T20:32:43.2858155Z     ) -> None:
2025-05-07T20:32:43.2858367Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2858604Z     
2025-05-07T20:32:43.2858870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2859208Z     
2025-05-07T20:32:43.2859400Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2859686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2859995Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2860236Z         x0 = x[:, :D]
2025-05-07T20:32:43.2860447Z         x1 = x[:, D:]
2025-05-07T20:32:43.2860651Z     
2025-05-07T20:32:43.2860834Z         if contiguous:
2025-05-07T20:32:43.2861135Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2861394Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2861634Z     
2025-05-07T20:32:43.2861820Z         if scale_ub is not None:
2025-05-07T20:32:43.2862092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2862478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2862822Z             )
2025-05-07T20:32:43.2863015Z         else:
2025-05-07T20:32:43.2863221Z             scale_ub_tensor = None
2025-05-07T20:32:43.2863468Z     
2025-05-07T20:32:43.2863700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2864014Z             op = silu_mul_quant
2025-05-07T20:32:43.2864268Z             if compiled:
2025-05-07T20:32:43.2864510Z                 op = torch.compile(op)
2025-05-07T20:32:43.2864805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2865081Z     
2025-05-07T20:32:43.2865267Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2865477Z 
2025-05-07T20:32:43.2865579Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2865876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2866206Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2866487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2867180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2867868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2868401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2869077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2869737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2870259Z     kernel = self.compile(
2025-05-07T20:32:43.2870802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2871454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2871851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2872084Z 
2025-05-07T20:32:43.2872293Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd8c065b0>
2025-05-07T20:32:43.2873366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2874727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8c96040>}
2025-05-07T20:32:43.2876104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2877132Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8cb8db0>
2025-05-07T20:32:43.2877418Z 
2025-05-07T20:32:43.2877588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2878124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2878589Z                            module_map=module_map)
2025-05-07T20:32:43.2878948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2879298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2879557Z E       ^
2025-05-07T20:32:43.2880021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2880474Z 
2025-05-07T20:32:43.2880892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2881421Z 
2025-05-07T20:32:43.2881525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2881938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2882331Z     T=2048,
2025-05-07T20:32:43.2882594Z     D=5120,
2025-05-07T20:32:43.2882826Z     scale_ub=None,
2025-05-07T20:32:43.2883035Z     contiguous=True,
2025-05-07T20:32:43.2883260Z     compiled=False,
2025-05-07T20:32:43.2883461Z )
2025-05-07T20:32:43.2883777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2884269Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2884539Z 
2025-05-07T20:32:43.2884616Z     @given(
2025-05-07T20:32:43.2884841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2885149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2885458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2885832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2886158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2886443Z     )
2025-05-07T20:32:43.2886794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2887241Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2887478Z         self,
2025-05-07T20:32:43.2887671Z         T: int,
2025-05-07T20:32:43.2887867Z         D: int,
2025-05-07T20:32:43.2888083Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2888352Z         contiguous: bool,
2025-05-07T20:32:43.2888590Z         compiled: bool,
2025-05-07T20:32:43.2888806Z     ) -> None:
2025-05-07T20:32:43.2889018Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2889260Z     
2025-05-07T20:32:43.2889531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2889872Z     
2025-05-07T20:32:43.2890063Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.2891999Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2893831Z 
2025-05-07T20:32:43.2893956Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.2894171Z 
2025-05-07T20:32:43.2894272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2894683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2895083Z     T=16384,
2025-05-07T20:32:43.2895269Z     D=5120,
2025-05-07T20:32:43.2895507Z     scale_ub=None,
2025-05-07T20:32:43.2895717Z     contiguous=True,
2025-05-07T20:32:43.2895935Z     compiled=False,
2025-05-07T20:32:43.2896139Z )
2025-05-07T20:32:43.2896454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2896948Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2897231Z 
2025-05-07T20:32:43.2897307Z     @given(
2025-05-07T20:32:43.2897534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2897845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2898143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2898468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2898798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2899078Z     )
2025-05-07T20:32:43.2899428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2899875Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2900124Z         self,
2025-05-07T20:32:43.2900319Z         T: int,
2025-05-07T20:32:43.2900514Z         D: int,
2025-05-07T20:32:43.2900727Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2900996Z         contiguous: bool,
2025-05-07T20:32:43.2901315Z         compiled: bool,
2025-05-07T20:32:43.2901579Z     ) -> None:
2025-05-07T20:32:43.2901832Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2902077Z     
2025-05-07T20:32:43.2902348Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2904396Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2906309Z 
2025-05-07T20:32:43.2906428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2906648Z 
2025-05-07T20:32:43.2906749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2907161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2907554Z     T=4096,
2025-05-07T20:32:43.2907739Z     D=5120,
2025-05-07T20:32:43.2907925Z     scale_ub=None,
2025-05-07T20:32:43.2908138Z     contiguous=True,
2025-05-07T20:32:43.2908356Z     compiled=False,
2025-05-07T20:32:43.2908558Z )
2025-05-07T20:32:43.2908868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2909357Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2909631Z 
2025-05-07T20:32:43.2909706Z     @given(
2025-05-07T20:32:43.2909933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2910244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2910547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2910901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2911248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2911541Z     )
2025-05-07T20:32:43.2911884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2912321Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2912557Z         self,
2025-05-07T20:32:43.2912746Z         T: int,
2025-05-07T20:32:43.2912939Z         D: int,
2025-05-07T20:32:43.2913152Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2913419Z         contiguous: bool,
2025-05-07T20:32:43.2913653Z         compiled: bool,
2025-05-07T20:32:43.2913868Z     ) -> None:
2025-05-07T20:32:43.2914080Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2914318Z     
2025-05-07T20:32:43.2914632Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2916682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2918546Z 
2025-05-07T20:32:43.2918662Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2918876Z 
2025-05-07T20:32:43.2918976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2919387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2919789Z     T=2048,
2025-05-07T20:32:43.2925234Z     D=5120,
2025-05-07T20:32:43.2925454Z     scale_ub=None,
2025-05-07T20:32:43.2925676Z     contiguous=False,
2025-05-07T20:32:43.2925904Z     compiled=False,
2025-05-07T20:32:43.2926108Z )
2025-05-07T20:32:43.2926422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2927029Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2927312Z 
2025-05-07T20:32:43.2927390Z     @given(
2025-05-07T20:32:43.2927620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2927933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2928233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2928564Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2928889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2929167Z     )
2025-05-07T20:32:43.2929515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2930006Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2930246Z         self,
2025-05-07T20:32:43.2930432Z         T: int,
2025-05-07T20:32:43.2930630Z         D: int,
2025-05-07T20:32:43.2930847Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2931120Z         contiguous: bool,
2025-05-07T20:32:43.2931367Z         compiled: bool,
2025-05-07T20:32:43.2931591Z     ) -> None:
2025-05-07T20:32:43.2931802Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2932041Z     
2025-05-07T20:32:43.2932311Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2934366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2936225Z 
2025-05-07T20:32:43.2936347Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2936566Z 
2025-05-07T20:32:43.2936671Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2937090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2937488Z     T=4096,
2025-05-07T20:32:43.2937669Z     D=7168,
2025-05-07T20:32:43.2937854Z     scale_ub=None,
2025-05-07T20:32:43.2938063Z     contiguous=True,
2025-05-07T20:32:43.2938280Z     compiled=True,
2025-05-07T20:32:43.2938481Z )
2025-05-07T20:32:43.2938798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2939287Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2939560Z 
2025-05-07T20:32:43.2939685Z     @given(
2025-05-07T20:32:43.2939914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2940500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2940805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2941185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2941529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2941813Z     )
2025-05-07T20:32:43.2942163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2942603Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2942839Z         self,
2025-05-07T20:32:43.2943028Z         T: int,
2025-05-07T20:32:43.2943223Z         D: int,
2025-05-07T20:32:43.2943436Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2943706Z         contiguous: bool,
2025-05-07T20:32:43.2943943Z         compiled: bool,
2025-05-07T20:32:43.2944161Z     ) -> None:
2025-05-07T20:32:43.2944380Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2944626Z     
2025-05-07T20:32:43.2944894Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2947017Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2948932Z 
2025-05-07T20:32:43.2949055Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2949270Z 
2025-05-07T20:32:43.2949372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2949788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2950239Z     T=2048,
2025-05-07T20:32:43.2950425Z     D=5120,
2025-05-07T20:32:43.2950615Z     scale_ub=1200.0,
2025-05-07T20:32:43.2950836Z     contiguous=False,
2025-05-07T20:32:43.2951056Z     compiled=False,
2025-05-07T20:32:43.2951258Z )
2025-05-07T20:32:43.2951576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2952068Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2952343Z 
2025-05-07T20:32:43.2952418Z     @given(
2025-05-07T20:32:43.2952645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2952952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2953254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2953584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2953908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2954195Z     )
2025-05-07T20:32:43.2954544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2954988Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2955225Z         self,
2025-05-07T20:32:43.2955419Z         T: int,
2025-05-07T20:32:43.2955614Z         D: int,
2025-05-07T20:32:43.2955829Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2956108Z         contiguous: bool,
2025-05-07T20:32:43.2956345Z         compiled: bool,
2025-05-07T20:32:43.2956560Z     ) -> None:
2025-05-07T20:32:43.2956773Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2957014Z     
2025-05-07T20:32:43.2957281Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2959353Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2961210Z 
2025-05-07T20:32:43.2961333Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2961554Z 
2025-05-07T20:32:43.2961656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2962068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2962466Z     T=4096,
2025-05-07T20:32:43.2962651Z     D=7168,
2025-05-07T20:32:43.2962842Z     scale_ub=1200.0,
2025-05-07T20:32:43.2963058Z     contiguous=True,
2025-05-07T20:32:43.2963276Z     compiled=False,
2025-05-07T20:32:43.2963480Z )
2025-05-07T20:32:43.2963789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2964287Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2964569Z 
2025-05-07T20:32:43.2964647Z     @given(
2025-05-07T20:32:43.2964868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2965180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2965486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2965683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2965805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2965879Z     )
2025-05-07T20:32:43.2966123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2966222Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2966300Z         self,
2025-05-07T20:32:43.2966380Z         T: int,
2025-05-07T20:32:43.2966456Z         D: int,
2025-05-07T20:32:43.2966553Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2966645Z         contiguous: bool,
2025-05-07T20:32:43.2966732Z         compiled: bool,
2025-05-07T20:32:43.2966856Z     ) -> None:
2025-05-07T20:32:43.2966949Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2967022Z     
2025-05-07T20:32:43.2967193Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2968941Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2968950Z 
2025-05-07T20:32:43.2969070Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2969074Z 
2025-05-07T20:32:43.2969180Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2969403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2969487Z     T=16384,
2025-05-07T20:32:43.2969567Z     D=7168,
2025-05-07T20:32:43.2969648Z     scale_ub=None,
2025-05-07T20:32:43.2969739Z     contiguous=False,
2025-05-07T20:32:43.2969825Z     compiled=True,
2025-05-07T20:32:43.2969904Z )
2025-05-07T20:32:43.2970120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2970295Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2970299Z 
2025-05-07T20:32:43.2970382Z     @given(
2025-05-07T20:32:43.2970500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2970599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2970717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2970832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2970987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2971069Z     )
2025-05-07T20:32:43.2971312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2971408Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2971485Z         self,
2025-05-07T20:32:43.2971563Z         T: int,
2025-05-07T20:32:43.2971651Z         D: int,
2025-05-07T20:32:43.2971749Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2971837Z         contiguous: bool,
2025-05-07T20:32:43.2971929Z         compiled: bool,
2025-05-07T20:32:43.2972009Z     ) -> None:
2025-05-07T20:32:43.2972104Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2972181Z     
2025-05-07T20:32:43.2972347Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2974142Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2974187Z 
2025-05-07T20:32:43.2974306Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2974311Z 
2025-05-07T20:32:43.2974416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2974636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2974713Z     T=4096,
2025-05-07T20:32:43.2974790Z     D=7168,
2025-05-07T20:32:43.2974871Z     scale_ub=None,
2025-05-07T20:32:43.2974954Z     contiguous=True,
2025-05-07T20:32:43.2975040Z     compiled=False,
2025-05-07T20:32:43.2975115Z )
2025-05-07T20:32:43.2975333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2975578Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2975582Z 
2025-05-07T20:32:43.2975660Z     @given(
2025-05-07T20:32:43.2975780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2975878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2975998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2976117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2976229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2976302Z     )
2025-05-07T20:32:43.2976554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2976648Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2976724Z         self,
2025-05-07T20:32:43.2976803Z         T: int,
2025-05-07T20:32:43.2976878Z         D: int,
2025-05-07T20:32:43.2976974Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2977072Z         contiguous: bool,
2025-05-07T20:32:43.2977158Z         compiled: bool,
2025-05-07T20:32:43.2977238Z     ) -> None:
2025-05-07T20:32:43.2977332Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2977404Z     
2025-05-07T20:32:43.2977576Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2979358Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2979365Z 
2025-05-07T20:32:43.2979485Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2979537Z 
2025-05-07T20:32:43.2979642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2979862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2979945Z     T=16384,
2025-05-07T20:32:43.2980024Z     D=7168,
2025-05-07T20:32:43.2980108Z     scale_ub=None,
2025-05-07T20:32:43.2980199Z     contiguous=True,
2025-05-07T20:32:43.2980283Z     compiled=False,
2025-05-07T20:32:43.2980358Z )
2025-05-07T20:32:43.2980577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2980768Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2980774Z 
2025-05-07T20:32:43.2980862Z     @given(
2025-05-07T20:32:43.2981003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2981147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2981264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2981383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2981505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2981581Z     )
2025-05-07T20:32:43.2981831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2981928Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2982081Z         self,
2025-05-07T20:32:43.2982160Z         T: int,
2025-05-07T20:32:43.2982239Z         D: int,
2025-05-07T20:32:43.2982336Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2982425Z         contiguous: bool,
2025-05-07T20:32:43.2982514Z         compiled: bool,
2025-05-07T20:32:43.2982591Z     ) -> None:
2025-05-07T20:32:43.2982686Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2982762Z     
2025-05-07T20:32:43.2982930Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2984685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2984732Z 
2025-05-07T20:32:43.2984849Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2984854Z 
2025-05-07T20:32:43.2984963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2985184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2985261Z     T=16384,
2025-05-07T20:32:43.2985343Z     D=7168,
2025-05-07T20:32:43.2985426Z     scale_ub=1200.0,
2025-05-07T20:32:43.2985509Z     contiguous=True,
2025-05-07T20:32:43.2985594Z     compiled=False,
2025-05-07T20:32:43.2985668Z )
2025-05-07T20:32:43.2985884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2986067Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2986072Z 
2025-05-07T20:32:43.2986149Z     @given(
2025-05-07T20:32:43.2986271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2986374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2986489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2986608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2986723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2986795Z     )
2025-05-07T20:32:43.2987046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2987138Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2987214Z         self,
2025-05-07T20:32:43.2987294Z         T: int,
2025-05-07T20:32:43.2987371Z         D: int,
2025-05-07T20:32:43.2987512Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2987609Z         contiguous: bool,
2025-05-07T20:32:43.2987694Z         compiled: bool,
2025-05-07T20:32:43.2987774Z     ) -> None:
2025-05-07T20:32:43.2987869Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2987941Z     
2025-05-07T20:32:43.2988116Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2989897Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2989906Z 
2025-05-07T20:32:43.2990029Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2990034Z 
2025-05-07T20:32:43.2990137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2990358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2990441Z     T=128,
2025-05-07T20:32:43.2990590Z     D=5120,
2025-05-07T20:32:43.2990673Z     scale_ub=1200.0,
2025-05-07T20:32:43.2990761Z     contiguous=False,
2025-05-07T20:32:43.2990844Z     compiled=False,
2025-05-07T20:32:43.2990919Z )
2025-05-07T20:32:43.2991133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2991308Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2991313Z 
2025-05-07T20:32:43.2991395Z     @given(
2025-05-07T20:32:43.2991513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2991613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2991773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2991896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2992009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2992085Z     )
2025-05-07T20:32:43.2992333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2992435Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2992515Z         self,
2025-05-07T20:32:43.2992592Z         T: int,
2025-05-07T20:32:43.2992669Z         D: int,
2025-05-07T20:32:43.2992769Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2992858Z         contiguous: bool,
2025-05-07T20:32:43.2992945Z         compiled: bool,
2025-05-07T20:32:43.2993024Z     ) -> None:
2025-05-07T20:32:43.2993120Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2993196Z     
2025-05-07T20:32:43.2993364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2993438Z     
2025-05-07T20:32:43.2993540Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2993665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2993754Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2993838Z         x0 = x[:, :D]
2025-05-07T20:32:43.2993918Z         x1 = x[:, D:]
2025-05-07T20:32:43.2993989Z     
2025-05-07T20:32:43.2994079Z         if contiguous:
2025-05-07T20:32:43.2994173Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2994267Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2994342Z     
2025-05-07T20:32:43.2994431Z         if scale_ub is not None:
2025-05-07T20:32:43.2994540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2994677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2994754Z             )
2025-05-07T20:32:43.2994835Z         else:
2025-05-07T20:32:43.2994929Z             scale_ub_tensor = None
2025-05-07T20:32:43.2995003Z     
2025-05-07T20:32:43.2995136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2995272Z             op = silu_mul_quant
2025-05-07T20:32:43.2995359Z             if compiled:
2025-05-07T20:32:43.2995463Z                 op = torch.compile(op)
2025-05-07T20:32:43.2995567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2995638Z     
2025-05-07T20:32:43.2995731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2995740Z 
2025-05-07T20:32:43.2995837Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2995966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2996066Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2996166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2996674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2996771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2997127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2997361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2997705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2997804Z     kernel = self.compile(
2025-05-07T20:32:43.2998261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2998440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2998571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2998575Z 
2025-05-07T20:32:43.2998782Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd89c5f10>
2025-05-07T20:32:43.2999557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.3000110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd8a46ca0>}
2025-05-07T20:32:43.3000852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.3001049Z context = <triton._C.libtriton.ir.context object at 0x7fbfd8982770>
2025-05-07T20:32:43.3001053Z 
2025-05-07T20:32:43.3001218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.3001484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.3001591Z                            module_map=module_map)
2025-05-07T20:32:43.3001752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.3001858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.3001935Z E       ^
2025-05-07T20:32:43.3002291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.3002296Z 
2025-05-07T20:32:43.3002708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.3002715Z 
2025-05-07T20:32:43.3002818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3003045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3003122Z     T=2048,
2025-05-07T20:32:43.3003202Z     D=7168,
2025-05-07T20:32:43.3003287Z     scale_ub=None,
2025-05-07T20:32:43.3003375Z     contiguous=False,
2025-05-07T20:32:43.3003462Z     compiled=False,
2025-05-07T20:32:43.3003535Z )
2025-05-07T20:32:43.3003749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3003969Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.3003975Z 
2025-05-07T20:32:43.3004053Z     @given(
2025-05-07T20:32:43.3004170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3004271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3004390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3004507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3004622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3004695Z     )
2025-05-07T20:32:43.3004943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3005035Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3005110Z         self,
2025-05-07T20:32:43.3005188Z         T: int,
2025-05-07T20:32:43.3005263Z         D: int,
2025-05-07T20:32:43.3005359Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3005449Z         contiguous: bool,
2025-05-07T20:32:43.3005542Z         compiled: bool,
2025-05-07T20:32:43.3005619Z     ) -> None:
2025-05-07T20:32:43.3005717Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3005789Z     
2025-05-07T20:32:43.3005956Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3007755Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.3007824Z 
2025-05-07T20:32:43.3007944Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.3007952Z 
2025-05-07T20:32:43.3008098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3008318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3008398Z     T=128,
2025-05-07T20:32:43.3008473Z     D=7168,
2025-05-07T20:32:43.3008554Z     scale_ub=1200.0,
2025-05-07T20:32:43.3008640Z     contiguous=True,
2025-05-07T20:32:43.3008729Z     compiled=True,
2025-05-07T20:32:43.3008801Z )
2025-05-07T20:32:43.3009021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3009189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.3009194Z 
2025-05-07T20:32:43.3009269Z     @given(
2025-05-07T20:32:43.3009391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3009489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3009605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3009720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3009841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3009921Z     )
2025-05-07T20:32:43.3010170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3010262Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3010339Z         self,
2025-05-07T20:32:43.3010420Z         T: int,
2025-05-07T20:32:43.3010497Z         D: int,
2025-05-07T20:32:43.3010601Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3010689Z         contiguous: bool,
2025-05-07T20:32:43.3010789Z         compiled: bool,
2025-05-07T20:32:43.3010882Z     ) -> None:
2025-05-07T20:32:43.3010990Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3011077Z     
2025-05-07T20:32:43.3011242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3011316Z     
2025-05-07T20:32:43.3011409Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.3011532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.3011686Z         x = x_sign * x_clamp
2025-05-07T20:32:43.3011771Z         x0 = x[:, :D]
2025-05-07T20:32:43.3011850Z         x1 = x[:, D:]
2025-05-07T20:32:43.3011921Z     
2025-05-07T20:32:43.3012008Z         if contiguous:
2025-05-07T20:32:43.3012100Z             x0 = x0.contiguous()
2025-05-07T20:32:43.3012190Z             x1 = x1.contiguous()
2025-05-07T20:32:43.3012272Z     
2025-05-07T20:32:43.3012362Z         if scale_ub is not None:
2025-05-07T20:32:43.3012469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.3012605Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.3012682Z             )
2025-05-07T20:32:43.3012761Z         else:
2025-05-07T20:32:43.3012858Z             scale_ub_tensor = None
2025-05-07T20:32:43.3012930Z     
2025-05-07T20:32:43.3013065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.3013154Z             op = silu_mul_quant
2025-05-07T20:32:43.3013239Z             if compiled:
2025-05-07T20:32:43.3013349Z                 op = torch.compile(op)
2025-05-07T20:32:43.3013454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3013525Z     
2025-05-07T20:32:43.3013617Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.3013622Z 
2025-05-07T20:32:43.3013719Z moe/activation_test.py:117: 
2025-05-07T20:32:43.3013891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.3014029Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.3014129Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3014501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.3014593Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.3015083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.3015185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.3015543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.3015812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.3016146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.3016246Z     kernel = self.compile(
2025-05-07T20:32:43.3016633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.3016809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.3016938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.3016943Z 
2025-05-07T20:32:43.3017151Z self = <triton.compiler.compiler.ASTSource object at 0x7fbfd88d4220>
2025-05-07T20:32:43.3017921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.3018437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fbfccec68b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fbfd89350d0>}
2025-05-07T20:32:43.3019191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.3019387Z context = <triton._C.libtriton.ir.context object at 0x7fbfd88ba4f0>
2025-05-07T20:32:43.3019392Z 
2025-05-07T20:32:43.3019559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.3019821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.3019930Z                            module_map=module_map)
2025-05-07T20:32:43.3020136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.3020242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.3020318Z E       ^
2025-05-07T20:32:43.3020669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.3020677Z 
2025-05-07T20:32:43.3021152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.3021157Z 
2025-05-07T20:32:43.3021258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3021482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3021558Z     T=128,
2025-05-07T20:32:43.3021633Z     D=7168,
2025-05-07T20:32:43.3021723Z     scale_ub=1200.0,
2025-05-07T20:32:43.3021807Z     contiguous=True,
2025-05-07T20:32:43.3021892Z     compiled=False,
2025-05-07T20:32:43.3021967Z )
2025-05-07T20:32:43.3022186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3022361Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.3022365Z 
2025-05-07T20:32:43.3022446Z     @given(
2025-05-07T20:32:43.3022562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3022737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3022856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3022971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3023090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3023164Z     )
2025-05-07T20:32:43.3023410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3023505Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3023581Z         self,
2025-05-07T20:32:43.3023656Z         T: int,
2025-05-07T20:32:43.3023735Z         D: int,
2025-05-07T20:32:43.3023833Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3023966Z         contiguous: bool,
2025-05-07T20:32:43.3024053Z         compiled: bool,
2025-05-07T20:32:43.3024130Z     ) -> None:
2025-05-07T20:32:43.3024230Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3024302Z     
2025-05-07T20:32:43.3024469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3024551Z     
2025-05-07T20:32:43.3024642Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.3024769Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.3026520Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.3026528Z 
2025-05-07T20:32:43.3026645Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.3026650Z 
2025-05-07T20:32:43.3026756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3026983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3027061Z     T=128,
2025-05-07T20:32:43.3027140Z     D=5120,
2025-05-07T20:32:43.3027221Z     scale_ub=1200.0,
2025-05-07T20:32:43.3027304Z     contiguous=True,
2025-05-07T20:32:43.3027390Z     compiled=True,
2025-05-07T20:32:43.3027461Z )
2025-05-07T20:32:43.3027677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3027846Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.3027851Z 
2025-05-07T20:32:43.3027928Z     @given(
2025-05-07T20:32:43.3028047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3028189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3028306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3028423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3028536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3028616Z     )
2025-05-07T20:32:43.3028862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3028956Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3029034Z         self,
2025-05-07T20:32:43.3029111Z         T: int,
2025-05-07T20:32:43.3029186Z         D: int,
2025-05-07T20:32:43.3029286Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3029375Z         contiguous: bool,
2025-05-07T20:32:43.3029459Z         compiled: bool,
2025-05-07T20:32:43.3029539Z     ) -> None:
2025-05-07T20:32:43.3029632Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3029703Z     
2025-05-07T20:32:43.3029875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3029953Z     
2025-05-07T20:32:43.3030044Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.3030171Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.3031952Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.3031996Z 
2025-05-07T20:32:43.3032117Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.3032121Z 
2025-05-07T20:32:43.3032224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3032493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3032571Z     T=128,
2025-05-07T20:32:43.3032646Z     D=7168,
2025-05-07T20:32:43.3032729Z     scale_ub=None,
2025-05-07T20:32:43.3032813Z     contiguous=True,
2025-05-07T20:32:43.3032894Z     compiled=True,
2025-05-07T20:32:43.3032972Z )
2025-05-07T20:32:43.3033191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3033360Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.3033365Z 
2025-05-07T20:32:43.3033440Z     @given(
2025-05-07T20:32:43.3033555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3033658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3033770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3033885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3034001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3034079Z     )
2025-05-07T20:32:43.3034327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3034421Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3034498Z         self,
2025-05-07T20:32:43.3034576Z         T: int,
2025-05-07T20:32:43.3034657Z         D: int,
2025-05-07T20:32:43.3034757Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3034850Z         contiguous: bool,
2025-05-07T20:32:43.3034934Z         compiled: bool,
2025-05-07T20:32:43.3035012Z     ) -> None:
2025-05-07T20:32:43.3035111Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3035185Z     
2025-05-07T20:32:43.3035353Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3037137Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.3037151Z 
2025-05-07T20:32:43.3037270Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.3037409Z =============================== warnings summary ===============================
2025-05-07T20:32:43.3037712Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.3038017Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.3038311Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.3039177Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:43.3039412Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:43.3039513Z 
2025-05-07T20:32:43.3039727Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:43.3039896Z ================= 1 failed, 1 deselected, 3 warnings in 19.45s =================
2025-05-07T20:32:44.8487177Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:44.9116742Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:44.9117078Z 
2025-05-07T20:32:46.9135965Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:49.0826173Z ============================= test session starts ==============================
2025-05-07T20:32:49.0826895Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:49.0827441Z cachedir: .pytest_cache
2025-05-07T20:32:49.0828044Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:49.0828788Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:49.0829224Z plugins: hypothesis-6.131.14
2025-05-07T20:32:50.6956144Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:50.9071847Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:50.9072943Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:50.9073267Z 
2025-05-07T20:32:53.6352094Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6353055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6353613Z     T=1,
2025-05-07T20:32:53.6353883Z     D=5120,
2025-05-07T20:32:53.6354145Z     scale_ub=None,
2025-05-07T20:32:53.6354384Z     contiguous=True,
2025-05-07T20:32:53.6361994Z     compiled=True,
2025-05-07T20:32:53.6362263Z )
2025-05-07T20:32:53.6362641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6363207Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.6363473Z 
2025-05-07T20:32:53.6363558Z     @given(
2025-05-07T20:32:53.6363807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6364139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6364753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6365115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6365463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6365768Z     )
2025-05-07T20:32:53.6366130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6366607Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6366869Z         self,
2025-05-07T20:32:53.6367074Z         T: int,
2025-05-07T20:32:53.6367291Z         D: int,
2025-05-07T20:32:53.6367526Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6367806Z         contiguous: bool,
2025-05-07T20:32:53.6368057Z         compiled: bool,
2025-05-07T20:32:53.6368307Z     ) -> None:
2025-05-07T20:32:53.6368532Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6368790Z     
2025-05-07T20:32:53.6369079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6369430Z     
2025-05-07T20:32:53.6369640Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.6369957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.6370276Z         x = x_sign * x_clamp
2025-05-07T20:32:53.6370532Z         x0 = x[:, :D]
2025-05-07T20:32:53.6370764Z         x1 = x[:, D:]
2025-05-07T20:32:53.6370980Z     
2025-05-07T20:32:53.6371177Z         if contiguous:
2025-05-07T20:32:53.6371610Z             x0 = x0.contiguous()
2025-05-07T20:32:53.6371886Z             x1 = x1.contiguous()
2025-05-07T20:32:53.6372136Z     
2025-05-07T20:32:53.6372342Z         if scale_ub is not None:
2025-05-07T20:32:53.6372629Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.6372975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.6373296Z             )
2025-05-07T20:32:53.6373505Z         else:
2025-05-07T20:32:53.6373724Z             scale_ub_tensor = None
2025-05-07T20:32:53.6373991Z     
2025-05-07T20:32:53.6374240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.6374565Z             op = silu_mul_quant
2025-05-07T20:32:53.6374918Z             if compiled:
2025-05-07T20:32:53.6375180Z                 op = torch.compile(op)
2025-05-07T20:32:53.6375481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6375771Z     
2025-05-07T20:32:53.6375978Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.6376277Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.6376583Z     
2025-05-07T20:32:53.6376835Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.6377192Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.6377498Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.6377822Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.6378199Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.6378515Z     
2025-05-07T20:32:53.6378733Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.6378932Z 
2025-05-07T20:32:53.6379046Z moe/activation_test.py:126: 
2025-05-07T20:32:53.6379354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6379708Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.6380052Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.6380862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.6381717Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.6382279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.6382974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.6383672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.6384457Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.6385224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.6385983Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.6386730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.6387387Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.6388002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.6388542Z     fn()
2025-05-07T20:32:53.6389059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.6389653Z     self.fn.run(
2025-05-07T20:32:53.6390126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.6390667Z     kernel = self.compile(
2025-05-07T20:32:53.6391227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.6391887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.6392381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6392616Z 
2025-05-07T20:32:53.6392825Z self = <triton.compiler.compiler.ASTSource object at 0x7f15924b7040>
2025-05-07T20:32:53.6393908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.6395294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15925d89d0>}
2025-05-07T20:32:53.6396692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.6397711Z context = <triton._C.libtriton.ir.context object at 0x7f1592bb34f0>
2025-05-07T20:32:53.6398008Z 
2025-05-07T20:32:53.6398179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.6398722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.6399200Z                            module_map=module_map)
2025-05-07T20:32:53.6399570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.6399944Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.6400229Z E       ^
2025-05-07T20:32:53.6400702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.6401163Z 
2025-05-07T20:32:53.6401581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.6402102Z 
2025-05-07T20:32:53.6402214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6402641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6403059Z     T=2048,
2025-05-07T20:32:53.6403255Z     D=5120,
2025-05-07T20:32:53.6403462Z     scale_ub=1200.0,
2025-05-07T20:32:53.6403700Z     contiguous=True,
2025-05-07T20:32:53.6403930Z     compiled=False,
2025-05-07T20:32:53.6404156Z )
2025-05-07T20:32:55.1138082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1138913Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.1139321Z 
2025-05-07T20:32:55.1139443Z     @given(
2025-05-07T20:32:55.1139794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1140880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1141296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1141646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1141986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1142290Z     )
2025-05-07T20:32:55.1142665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1143123Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1143381Z         self,
2025-05-07T20:32:55.1143592Z         T: int,
2025-05-07T20:32:55.1143798Z         D: int,
2025-05-07T20:32:55.1144035Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1144321Z         contiguous: bool,
2025-05-07T20:32:55.1144575Z         compiled: bool,
2025-05-07T20:32:55.1144814Z     ) -> None:
2025-05-07T20:32:55.1145048Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1145303Z     
2025-05-07T20:32:55.1145588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1145954Z     
2025-05-07T20:32:55.1146164Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1146464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1146794Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1147054Z         x0 = x[:, :D]
2025-05-07T20:32:55.1147374Z         x1 = x[:, D:]
2025-05-07T20:32:55.1147705Z     
2025-05-07T20:32:55.1147909Z         if contiguous:
2025-05-07T20:32:55.1148148Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1148425Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1148681Z     
2025-05-07T20:32:55.1148878Z         if scale_ub is not None:
2025-05-07T20:32:55.1149167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1149517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1149839Z             )
2025-05-07T20:32:55.1150047Z         else:
2025-05-07T20:32:55.1150272Z             scale_ub_tensor = None
2025-05-07T20:32:55.1150539Z     
2025-05-07T20:32:55.1150875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1151207Z             op = silu_mul_quant
2025-05-07T20:32:55.1151470Z             if compiled:
2025-05-07T20:32:55.1151726Z                 op = torch.compile(op)
2025-05-07T20:32:55.1152035Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1152330Z     
2025-05-07T20:32:55.1152531Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.1152709Z 
2025-05-07T20:32:55.1152816Z moe/activation_test.py:117: 
2025-05-07T20:32:55.1153129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1153469Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.1153772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1154514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.1155241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.1155792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1156486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1157164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.1157706Z     kernel = self.compile(
2025-05-07T20:32:55.1158268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.1158937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1159351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1159588Z 
2025-05-07T20:32:55.1159801Z self = <triton.compiler.compiler.ASTSource object at 0x7f15924c7eb0>
2025-05-07T20:32:55.1160958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.1162352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f156fced5e0>}
2025-05-07T20:32:55.1163704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.1164779Z context = <triton._C.libtriton.ir.context object at 0x7f159112ddf0>
2025-05-07T20:32:55.1165073Z 
2025-05-07T20:32:55.1165245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.1165788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1166266Z                            module_map=module_map)
2025-05-07T20:32:55.1166648Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1167024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1167298Z E       ^
2025-05-07T20:32:55.1167773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1168324Z 
2025-05-07T20:32:55.1168745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.1169265Z 
2025-05-07T20:32:55.1169373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1169800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1170212Z     T=2048,
2025-05-07T20:32:55.1170408Z     D=5120,
2025-05-07T20:32:55.1170613Z     scale_ub=1200.0,
2025-05-07T20:32:55.1170846Z     contiguous=True,
2025-05-07T20:32:55.1171077Z     compiled=True,
2025-05-07T20:32:55.1171298Z )
2025-05-07T20:32:55.1171629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1172183Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.1172486Z 
2025-05-07T20:32:55.1172570Z     @given(
2025-05-07T20:32:55.1172812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1173141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1173462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1173807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1174142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1174441Z     )
2025-05-07T20:32:55.1174804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1175254Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1175509Z         self,
2025-05-07T20:32:55.1175716Z         T: int,
2025-05-07T20:32:55.1175920Z         D: int,
2025-05-07T20:32:55.1176153Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1176446Z         contiguous: bool,
2025-05-07T20:32:55.1176692Z         compiled: bool,
2025-05-07T20:32:55.1176930Z     ) -> None:
2025-05-07T20:32:55.1177164Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1177420Z     
2025-05-07T20:32:55.1177693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1178052Z     
2025-05-07T20:32:55.1178255Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1178550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1178874Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1179129Z         x0 = x[:, :D]
2025-05-07T20:32:55.1179350Z         x1 = x[:, D:]
2025-05-07T20:32:55.1179567Z     
2025-05-07T20:32:55.1179763Z         if contiguous:
2025-05-07T20:32:55.1179997Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1180266Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1180520Z     
2025-05-07T20:32:55.1180723Z         if scale_ub is not None:
2025-05-07T20:32:55.1181124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1181473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1181785Z             )
2025-05-07T20:32:55.1181993Z         else:
2025-05-07T20:32:55.1182217Z             scale_ub_tensor = None
2025-05-07T20:32:55.1182479Z     
2025-05-07T20:32:55.1182721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1183050Z             op = silu_mul_quant
2025-05-07T20:32:55.1183314Z             if compiled:
2025-05-07T20:32:55.1183568Z                 op = torch.compile(op)
2025-05-07T20:32:55.1183875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1184170Z     
2025-05-07T20:32:55.1184372Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.1184709Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.1185031Z     
2025-05-07T20:32:55.1185271Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1185626Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.1185936Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.1186257Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.1186632Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.1186960Z     
2025-05-07T20:32:55.1187269Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.1187475Z 
2025-05-07T20:32:55.1187580Z moe/activation_test.py:126: 
2025-05-07T20:32:55.1187889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1188240Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.1188571Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.1189370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.1190132Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.1190698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1191438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1192134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.1192872Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.1193623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.1194390Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.1195178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.1195821Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.1196427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.1196957Z     fn()
2025-05-07T20:32:55.1197475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.1198058Z     self.fn.run(
2025-05-07T20:32:55.1198532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.1199079Z     kernel = self.compile(
2025-05-07T20:32:55.1199630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.1200285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1200691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1200933Z 
2025-05-07T20:32:55.1201143Z self = <triton.compiler.compiler.ASTSource object at 0x7f15915dcc40>
2025-05-07T20:32:55.1202295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.1203680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1591056430>}
2025-05-07T20:32:55.1205091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.1206116Z context = <triton._C.libtriton.ir.context object at 0x7f1590ed2030>
2025-05-07T20:32:55.1206409Z 
2025-05-07T20:32:55.1206587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.1207129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1207606Z                            module_map=module_map)
2025-05-07T20:32:55.1207983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1208351Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.1208624Z E       ^
2025-05-07T20:32:55.1209140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1209667Z 
2025-05-07T20:32:55.1210098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.1210612Z 
2025-05-07T20:32:55.1210726Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1211143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1211558Z     T=16384,
2025-05-07T20:32:55.1211764Z     D=7168,
2025-05-07T20:32:55.1211962Z     scale_ub=1200.0,
2025-05-07T20:32:55.1212199Z     contiguous=False,
2025-05-07T20:32:55.1212488Z     compiled=False,
2025-05-07T20:32:55.1212702Z )
2025-05-07T20:32:56.4793592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.4794440Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:56.4794848Z 
2025-05-07T20:32:56.4794986Z     @given(
2025-05-07T20:32:56.4795304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.4795635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.4795958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.4796302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.4796644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.4796942Z     )
2025-05-07T20:32:56.4797304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.4797755Z     def test_silu_mul_quant(
2025-05-07T20:32:56.4798007Z         self,
2025-05-07T20:32:56.4798257Z         T: int,
2025-05-07T20:32:56.4798466Z         D: int,
2025-05-07T20:32:56.4798694Z         scale_ub: Optional[float],
2025-05-07T20:32:56.4798979Z         contiguous: bool,
2025-05-07T20:32:56.4799234Z         compiled: bool,
2025-05-07T20:32:56.4799474Z     ) -> None:
2025-05-07T20:32:56.4799695Z         torch.manual_seed(2025)
2025-05-07T20:32:56.4799955Z     
2025-05-07T20:32:56.4800239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.4800587Z     
2025-05-07T20:32:56.4800793Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.4801099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.4801416Z         x = x_sign * x_clamp
2025-05-07T20:32:56.4801671Z         x0 = x[:, :D]
2025-05-07T20:32:56.4801899Z         x1 = x[:, D:]
2025-05-07T20:32:56.4802112Z     
2025-05-07T20:32:56.4802311Z         if contiguous:
2025-05-07T20:32:56.4802556Z             x0 = x0.contiguous()
2025-05-07T20:32:56.4802823Z             x1 = x1.contiguous()
2025-05-07T20:32:56.4803375Z     
2025-05-07T20:32:56.4803582Z         if scale_ub is not None:
2025-05-07T20:32:56.4803864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.4804218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.4804539Z             )
2025-05-07T20:32:56.4804744Z         else:
2025-05-07T20:32:56.4804969Z             scale_ub_tensor = None
2025-05-07T20:32:56.4805233Z     
2025-05-07T20:32:56.4805479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.4805800Z             op = silu_mul_quant
2025-05-07T20:32:56.4806067Z             if compiled:
2025-05-07T20:32:56.4806331Z                 op = torch.compile(op)
2025-05-07T20:32:56.4806632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4806924Z     
2025-05-07T20:32:56.4807128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.4807298Z 
2025-05-07T20:32:56.4807403Z moe/activation_test.py:117: 
2025-05-07T20:32:56.4807718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4808065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.4808356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4809057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.4809933Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.4810491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.4811178Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.4811850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.4812393Z     kernel = self.compile(
2025-05-07T20:32:56.4812947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.4813694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.4814100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4814335Z 
2025-05-07T20:32:56.4814551Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590f921c0>
2025-05-07T20:32:56.4815636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.4817020Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590ffe9d0>}
2025-05-07T20:32:56.4818358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.4819382Z context = <triton._C.libtriton.ir.context object at 0x7f1590a62370>
2025-05-07T20:32:56.4819671Z 
2025-05-07T20:32:56.4819852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.4820381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.4820857Z                            module_map=module_map)
2025-05-07T20:32:56.4821351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.4821708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.4821985Z E       ^
2025-05-07T20:32:56.4822457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.4822911Z 
2025-05-07T20:32:56.4823334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.4823856Z 
2025-05-07T20:32:56.4824017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.4824441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.4824856Z     T=1,
2025-05-07T20:32:56.4825053Z     D=7168,
2025-05-07T20:32:56.4825253Z     scale_ub=None,
2025-05-07T20:32:56.4825482Z     contiguous=True,
2025-05-07T20:32:56.4825721Z     compiled=True,
2025-05-07T20:32:56.4825933Z )
2025-05-07T20:32:56.4826266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.4826759Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:56.4827020Z 
2025-05-07T20:32:56.4827104Z     @given(
2025-05-07T20:32:56.4827346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.4827670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.4828001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.4835491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.4835864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.4836164Z     )
2025-05-07T20:32:56.4836540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.4837000Z     def test_silu_mul_quant(
2025-05-07T20:32:56.4837248Z         self,
2025-05-07T20:32:56.4837537Z         T: int,
2025-05-07T20:32:56.4837793Z         D: int,
2025-05-07T20:32:56.4838031Z         scale_ub: Optional[float],
2025-05-07T20:32:56.4838308Z         contiguous: bool,
2025-05-07T20:32:56.4838562Z         compiled: bool,
2025-05-07T20:32:56.4838801Z     ) -> None:
2025-05-07T20:32:56.4839023Z         torch.manual_seed(2025)
2025-05-07T20:32:56.4839282Z     
2025-05-07T20:32:56.4839567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.4839913Z     
2025-05-07T20:32:56.4840429Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.4840745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.4841072Z         x = x_sign * x_clamp
2025-05-07T20:32:56.4841417Z         x0 = x[:, :D]
2025-05-07T20:32:56.4841648Z         x1 = x[:, D:]
2025-05-07T20:32:56.4841858Z     
2025-05-07T20:32:56.4842058Z         if contiguous:
2025-05-07T20:32:56.4842301Z             x0 = x0.contiguous()
2025-05-07T20:32:56.4842563Z             x1 = x1.contiguous()
2025-05-07T20:32:56.4842821Z     
2025-05-07T20:32:56.4843027Z         if scale_ub is not None:
2025-05-07T20:32:56.4843308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.4843659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.4843986Z             )
2025-05-07T20:32:56.4844194Z         else:
2025-05-07T20:32:56.4844415Z             scale_ub_tensor = None
2025-05-07T20:32:56.4844682Z     
2025-05-07T20:32:56.4844928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.4845246Z             op = silu_mul_quant
2025-05-07T20:32:56.4845508Z             if compiled:
2025-05-07T20:32:56.4845771Z                 op = torch.compile(op)
2025-05-07T20:32:56.4846080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4846369Z     
2025-05-07T20:32:56.4846574Z         y_fp8, y_scale = fn()
2025-05-07T20:32:56.4846862Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:56.4847166Z     
2025-05-07T20:32:56.4847419Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.4847761Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:56.4848067Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:56.4848395Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:56.4848765Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.4849083Z     
2025-05-07T20:32:56.4849300Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:56.4849502Z 
2025-05-07T20:32:56.4849615Z moe/activation_test.py:126: 
2025-05-07T20:32:56.4849925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4850354Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:56.4850703Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.4851507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:56.4852276Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:56.4852839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.4853537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.4854227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:56.4854964Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.4855730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:56.4856491Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.4857217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:56.4857993Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:56.4858612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:56.4859138Z     fn()
2025-05-07T20:32:56.4859651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:56.4860237Z     self.fn.run(
2025-05-07T20:32:56.4860717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.4861317Z     kernel = self.compile(
2025-05-07T20:32:56.4861875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.4862589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.4863003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4863236Z 
2025-05-07T20:32:56.4863455Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590c3bc70>
2025-05-07T20:32:56.4864538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.4865936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590ffef70>}
2025-05-07T20:32:56.4867280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.4868308Z context = <triton._C.libtriton.ir.context object at 0x7f1590b7bb30>
2025-05-07T20:32:56.4868606Z 
2025-05-07T20:32:56.4868777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.4869323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.4869796Z                            module_map=module_map)
2025-05-07T20:32:56.4870168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.4870537Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:56.4870815Z E       ^
2025-05-07T20:32:56.4871275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.4871730Z 
2025-05-07T20:32:56.4872197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.4872732Z 
2025-05-07T20:32:56.4872837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.4873259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.4873662Z     T=4096,
2025-05-07T20:32:56.4873857Z     D=5120,
2025-05-07T20:32:56.4874064Z     scale_ub=None,
2025-05-07T20:32:56.4874282Z     contiguous=False,
2025-05-07T20:32:56.4874516Z     compiled=False,
2025-05-07T20:32:56.4874730Z )
2025-05-07T20:32:58.2388059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2388877Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.2389280Z 
2025-05-07T20:32:58.2389395Z     @given(
2025-05-07T20:32:58.2389722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2390048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2390375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2390752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2391094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2391396Z     )
2025-05-07T20:32:58.2391765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2392654Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2392919Z         self,
2025-05-07T20:32:58.2393133Z         T: int,
2025-05-07T20:32:58.2393338Z         D: int,
2025-05-07T20:32:58.2393574Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2393862Z         contiguous: bool,
2025-05-07T20:32:58.2394112Z         compiled: bool,
2025-05-07T20:32:58.2394355Z     ) -> None:
2025-05-07T20:32:58.2394586Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2394836Z     
2025-05-07T20:32:58.2395122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2395503Z     
2025-05-07T20:32:58.2395727Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2396134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2396463Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2396719Z         x0 = x[:, :D]
2025-05-07T20:32:58.2396941Z         x1 = x[:, D:]
2025-05-07T20:32:58.2397159Z     
2025-05-07T20:32:58.2397360Z         if contiguous:
2025-05-07T20:32:58.2397604Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2397877Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2398131Z     
2025-05-07T20:32:58.2398328Z         if scale_ub is not None:
2025-05-07T20:32:58.2398612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2398961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2399277Z             )
2025-05-07T20:32:58.2399482Z         else:
2025-05-07T20:32:58.2399705Z             scale_ub_tensor = None
2025-05-07T20:32:58.2399963Z     
2025-05-07T20:32:58.2400209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2400539Z             op = silu_mul_quant
2025-05-07T20:32:58.2400804Z             if compiled:
2025-05-07T20:32:58.2401064Z                 op = torch.compile(op)
2025-05-07T20:32:58.2401376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2401667Z     
2025-05-07T20:32:58.2401864Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2402040Z 
2025-05-07T20:32:58.2402151Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2402459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2402798Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2403091Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2403792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2404484Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2405034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2405882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2406565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2407102Z     kernel = self.compile(
2025-05-07T20:32:58.2407656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2408322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2408731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2408969Z 
2025-05-07T20:32:58.2409183Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590d6e460>
2025-05-07T20:32:58.2410269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2411685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590c1fca0>}
2025-05-07T20:32:58.2413077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2414133Z context = <triton._C.libtriton.ir.context object at 0x7f1590aa4b70>
2025-05-07T20:32:58.2414429Z 
2025-05-07T20:32:58.2414604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2415138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2415613Z                            module_map=module_map)
2025-05-07T20:32:58.2415986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2416354Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2416673Z E       ^
2025-05-07T20:32:58.2417136Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2417600Z 
2025-05-07T20:32:58.2418028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2418550Z 
2025-05-07T20:32:58.2418659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2419085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2419489Z     T=4096,
2025-05-07T20:32:58.2419687Z     D=7168,
2025-05-07T20:32:58.2419891Z     scale_ub=None,
2025-05-07T20:32:58.2420111Z     contiguous=False,
2025-05-07T20:32:58.2420345Z     compiled=False,
2025-05-07T20:32:58.2420565Z )
2025-05-07T20:32:58.2420884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2421516Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.2421798Z 
2025-05-07T20:32:58.2421879Z     @given(
2025-05-07T20:32:58.2422114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2422431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2422745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2423090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2423441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2423741Z     )
2025-05-07T20:32:58.2424103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2424549Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2424800Z         self,
2025-05-07T20:32:58.2425005Z         T: int,
2025-05-07T20:32:58.2425205Z         D: int,
2025-05-07T20:32:58.2425457Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2425762Z         contiguous: bool,
2025-05-07T20:32:58.2426003Z         compiled: bool,
2025-05-07T20:32:58.2426295Z     ) -> None:
2025-05-07T20:32:58.2426523Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2426768Z     
2025-05-07T20:32:58.2427047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2427400Z     
2025-05-07T20:32:58.2427601Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2427909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2428234Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2428478Z         x0 = x[:, :D]
2025-05-07T20:32:58.2428703Z         x1 = x[:, D:]
2025-05-07T20:32:58.2428920Z     
2025-05-07T20:32:58.2429118Z         if contiguous:
2025-05-07T20:32:58.2429353Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2429620Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2429872Z     
2025-05-07T20:32:58.2430067Z         if scale_ub is not None:
2025-05-07T20:32:58.2430352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2430700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2431024Z             )
2025-05-07T20:32:58.2431229Z         else:
2025-05-07T20:32:58.2431450Z             scale_ub_tensor = None
2025-05-07T20:32:58.2431704Z     
2025-05-07T20:32:58.2431944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2432265Z             op = silu_mul_quant
2025-05-07T20:32:58.2432616Z             if compiled:
2025-05-07T20:32:58.2432878Z                 op = torch.compile(op)
2025-05-07T20:32:58.2433183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2433462Z     
2025-05-07T20:32:58.2433666Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2433842Z 
2025-05-07T20:32:58.2433945Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2434252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2434591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2434884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2435588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2436334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2436880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2437574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2438250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2438785Z     kernel = self.compile(
2025-05-07T20:32:58.2439333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2439998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2440728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2440975Z 
2025-05-07T20:32:58.2441193Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590c38340>
2025-05-07T20:32:58.2442278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2443650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590c7a700>}
2025-05-07T20:32:58.2444994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2446008Z context = <triton._C.libtriton.ir.context object at 0x7f15905b15b0>
2025-05-07T20:32:58.2446306Z 
2025-05-07T20:32:58.2446477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2447089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2447574Z                            module_map=module_map)
2025-05-07T20:32:58.2447948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2448313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2448604Z E       ^
2025-05-07T20:32:58.2449066Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2449520Z 
2025-05-07T20:32:58.2449947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2450467Z 
2025-05-07T20:32:58.2450574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2450997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2451399Z     T=128,
2025-05-07T20:32:58.2451601Z     D=7168,
2025-05-07T20:32:58.2451809Z     scale_ub=None,
2025-05-07T20:32:58.2452027Z     contiguous=False,
2025-05-07T20:32:58.2452266Z     compiled=True,
2025-05-07T20:32:58.2452478Z )
2025-05-07T20:32:58.3209831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3210893Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.3211343Z 
2025-05-07T20:32:58.3211431Z     @given(
2025-05-07T20:32:58.3211679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3212006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3212331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3212671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3213020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3213318Z     )
2025-05-07T20:32:58.3213673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3214132Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3214523Z         self,
2025-05-07T20:32:58.3214725Z         T: int,
2025-05-07T20:32:58.3214938Z         D: int,
2025-05-07T20:32:58.3215172Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3215452Z         contiguous: bool,
2025-05-07T20:32:58.3215706Z         compiled: bool,
2025-05-07T20:32:58.3215952Z     ) -> None:
2025-05-07T20:32:58.3216174Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3216434Z     
2025-05-07T20:32:58.3216720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3217073Z     
2025-05-07T20:32:58.3217272Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.3217577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.3217900Z         x = x_sign * x_clamp
2025-05-07T20:32:58.3218147Z         x0 = x[:, :D]
2025-05-07T20:32:58.3218378Z         x1 = x[:, D:]
2025-05-07T20:32:58.3218602Z     
2025-05-07T20:32:58.3218794Z         if contiguous:
2025-05-07T20:32:58.3219049Z             x0 = x0.contiguous()
2025-05-07T20:32:58.3219323Z             x1 = x1.contiguous()
2025-05-07T20:32:58.3219571Z     
2025-05-07T20:32:58.3219778Z         if scale_ub is not None:
2025-05-07T20:32:58.3220066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.3220414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.3220744Z             )
2025-05-07T20:32:58.3220952Z         else:
2025-05-07T20:32:58.3221277Z             scale_ub_tensor = None
2025-05-07T20:32:58.3221546Z     
2025-05-07T20:32:58.3221794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3222125Z             op = silu_mul_quant
2025-05-07T20:32:58.3222385Z             if compiled:
2025-05-07T20:32:58.3222645Z                 op = torch.compile(op)
2025-05-07T20:32:58.3222961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3223243Z     
2025-05-07T20:32:58.3223451Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.3223837Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.3224142Z     
2025-05-07T20:32:58.3224394Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3224746Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.3225048Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.3225379Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.3225750Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.3226072Z     
2025-05-07T20:32:58.3226278Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.3226485Z 
2025-05-07T20:32:58.3226590Z moe/activation_test.py:126: 
2025-05-07T20:32:58.3226902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3227242Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.3227586Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.3228395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.3229154Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.3229720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.3230517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.3231228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.3231953Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.3232716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.3233470Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.3234205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.3234897Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.3235517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.3236056Z     fn()
2025-05-07T20:32:58.3236569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.3237157Z     self.fn.run(
2025-05-07T20:32:58.3237634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.3238172Z     kernel = self.compile(
2025-05-07T20:32:58.3238716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.3239377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.3239789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3240030Z 
2025-05-07T20:32:58.3240532Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590566550>
2025-05-07T20:32:58.3241621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.3243011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590a2f5e0>}
2025-05-07T20:32:58.3244355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.3245390Z context = <triton._C.libtriton.ir.context object at 0x7f15904cd870>
2025-05-07T20:32:58.3245688Z 
2025-05-07T20:32:58.3245936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.3246483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.3246964Z                            module_map=module_map)
2025-05-07T20:32:58.3247350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.3247716Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.3247999Z E       ^
2025-05-07T20:32:58.3248478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.3248932Z 
2025-05-07T20:32:58.3249358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.3249881Z 
2025-05-07T20:32:58.3249990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3250421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3250841Z     T=128,
2025-05-07T20:32:58.3251038Z     D=7168,
2025-05-07T20:32:58.3251244Z     scale_ub=None,
2025-05-07T20:32:58.3251474Z     contiguous=False,
2025-05-07T20:32:58.3251709Z     compiled=False,
2025-05-07T20:32:58.3251927Z )
2025-05-07T20:32:58.7325873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.7326764Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.7327044Z 
2025-05-07T20:32:58.7327137Z     @given(
2025-05-07T20:32:58.7327383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.7327709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.7328029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.7328375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.7328710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.7329011Z     )
2025-05-07T20:32:58.7329376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.7337576Z     def test_silu_mul_quant(
2025-05-07T20:32:58.7337860Z         self,
2025-05-07T20:32:58.7338076Z         T: int,
2025-05-07T20:32:58.7338284Z         D: int,
2025-05-07T20:32:58.7338520Z         scale_ub: Optional[float],
2025-05-07T20:32:58.7338826Z         contiguous: bool,
2025-05-07T20:32:58.7339075Z         compiled: bool,
2025-05-07T20:32:58.7339329Z     ) -> None:
2025-05-07T20:32:58.7339564Z         torch.manual_seed(2025)
2025-05-07T20:32:58.7339814Z     
2025-05-07T20:32:58.7340377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.7340744Z     
2025-05-07T20:32:58.7340945Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.7341354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.7341679Z         x = x_sign * x_clamp
2025-05-07T20:32:58.7341927Z         x0 = x[:, :D]
2025-05-07T20:32:58.7342160Z         x1 = x[:, D:]
2025-05-07T20:32:58.7342392Z     
2025-05-07T20:32:58.7342585Z         if contiguous:
2025-05-07T20:32:58.7342836Z             x0 = x0.contiguous()
2025-05-07T20:32:58.7343110Z             x1 = x1.contiguous()
2025-05-07T20:32:58.7343365Z     
2025-05-07T20:32:58.7343569Z         if scale_ub is not None:
2025-05-07T20:32:58.7343859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.7344214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.7344530Z             )
2025-05-07T20:32:58.7344738Z         else:
2025-05-07T20:32:58.7344962Z             scale_ub_tensor = None
2025-05-07T20:32:58.7345221Z     
2025-05-07T20:32:58.7345475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.7345855Z             op = silu_mul_quant
2025-05-07T20:32:58.7346116Z             if compiled:
2025-05-07T20:32:58.7346383Z                 op = torch.compile(op)
2025-05-07T20:32:58.7346696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7346980Z     
2025-05-07T20:32:58.7347328Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.7347502Z 
2025-05-07T20:32:58.7347619Z moe/activation_test.py:117: 
2025-05-07T20:32:58.7347931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7348269Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.7348573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7349289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.7349984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.7350537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.7351227Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.7351900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.7352442Z     kernel = self.compile(
2025-05-07T20:32:58.7353005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.7353671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.7354143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7354444Z 
2025-05-07T20:32:58.7354657Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590667b80>
2025-05-07T20:32:58.7355750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.7357141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15906d2ee0>}
2025-05-07T20:32:58.7358567Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.7359578Z context = <triton._C.libtriton.ir.context object at 0x7f1590059d70>
2025-05-07T20:32:58.7359878Z 
2025-05-07T20:32:58.7360053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.7360595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.7361076Z                            module_map=module_map)
2025-05-07T20:32:58.7361451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.7361817Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.7362090Z E       ^
2025-05-07T20:32:58.7362554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.7363024Z 
2025-05-07T20:32:58.7363445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7363965Z 
2025-05-07T20:32:58.7364073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7364496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7364905Z     T=4096,
2025-05-07T20:32:58.7365109Z     D=5120,
2025-05-07T20:32:58.7365314Z     scale_ub=1200.0,
2025-05-07T20:32:58.7365546Z     contiguous=True,
2025-05-07T20:32:58.7365809Z     compiled=False,
2025-05-07T20:32:58.7366048Z )
2025-05-07T20:32:58.7366369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.7366875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.7367151Z 
2025-05-07T20:32:58.7367243Z     @given(
2025-05-07T20:32:58.7367488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.7367857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.7368186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.7368536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.7368876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.7369181Z     )
2025-05-07T20:32:58.7369559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.7370006Z     def test_silu_mul_quant(
2025-05-07T20:32:58.7370261Z         self,
2025-05-07T20:32:58.7370470Z         T: int,
2025-05-07T20:32:58.7370678Z         D: int,
2025-05-07T20:32:58.7370909Z         scale_ub: Optional[float],
2025-05-07T20:32:58.7371194Z         contiguous: bool,
2025-05-07T20:32:58.7371436Z         compiled: bool,
2025-05-07T20:32:58.7371676Z     ) -> None:
2025-05-07T20:32:58.7371906Z         torch.manual_seed(2025)
2025-05-07T20:32:58.7372152Z     
2025-05-07T20:32:58.7372437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.7372804Z     
2025-05-07T20:32:58.7373007Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.7373302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.7373628Z         x = x_sign * x_clamp
2025-05-07T20:32:58.7373888Z         x0 = x[:, :D]
2025-05-07T20:32:58.7374111Z         x1 = x[:, D:]
2025-05-07T20:32:58.7374425Z     
2025-05-07T20:32:58.7374627Z         if contiguous:
2025-05-07T20:32:58.7374866Z             x0 = x0.contiguous()
2025-05-07T20:32:58.7375133Z             x1 = x1.contiguous()
2025-05-07T20:32:58.7375379Z     
2025-05-07T20:32:58.7375597Z         if scale_ub is not None:
2025-05-07T20:32:58.7375911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.7376257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.7376575Z             )
2025-05-07T20:32:58.7376780Z         else:
2025-05-07T20:32:58.7377005Z             scale_ub_tensor = None
2025-05-07T20:32:58.7377263Z     
2025-05-07T20:32:58.7377510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.7377889Z             op = silu_mul_quant
2025-05-07T20:32:58.7378151Z             if compiled:
2025-05-07T20:32:58.7378405Z                 op = torch.compile(op)
2025-05-07T20:32:58.7378714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7378995Z     
2025-05-07T20:32:58.7379200Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.7379368Z 
2025-05-07T20:32:58.7379476Z moe/activation_test.py:117: 
2025-05-07T20:32:58.7379773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7380114Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.7380405Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7381193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.7381881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.7382427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.7383129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.7383788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.7384343Z     kernel = self.compile(
2025-05-07T20:32:58.7384892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.7385559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.7385966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7386206Z 
2025-05-07T20:32:58.7386420Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590075f10>
2025-05-07T20:32:58.7387549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.7388927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15907a9670>}
2025-05-07T20:32:58.7390259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.7391289Z context = <triton._C.libtriton.ir.context object at 0x7f15904bd1f0>
2025-05-07T20:32:58.7391585Z 
2025-05-07T20:32:58.7391758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.7392288Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.7392753Z                            module_map=module_map)
2025-05-07T20:32:58.7393136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.7393500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.7393770Z E       ^
2025-05-07T20:32:58.7394229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.7394742Z 
2025-05-07T20:32:58.7395204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7395724Z 
2025-05-07T20:32:58.7395842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7396256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7396665Z     T=1,
2025-05-07T20:32:58.7396861Z     D=5120,
2025-05-07T20:32:58.7397062Z     scale_ub=None,
2025-05-07T20:32:58.7397278Z     contiguous=True,
2025-05-07T20:32:58.7397509Z     compiled=True,
2025-05-07T20:32:58.7397720Z )
2025-05-07T20:32:59.3873700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3874752Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:59.3875105Z 
2025-05-07T20:32:59.3875217Z     @given(
2025-05-07T20:32:59.3875522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3875956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3876292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3876638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3876987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3877286Z     )
2025-05-07T20:32:59.3877643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3878099Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3878355Z         self,
2025-05-07T20:32:59.3878558Z         T: int,
2025-05-07T20:32:59.3878769Z         D: int,
2025-05-07T20:32:59.3878999Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3879291Z         contiguous: bool,
2025-05-07T20:32:59.3879537Z         compiled: bool,
2025-05-07T20:32:59.3879783Z     ) -> None:
2025-05-07T20:32:59.3880010Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3880257Z     
2025-05-07T20:32:59.3880538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3880898Z     
2025-05-07T20:32:59.3881098Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3881399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3881720Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3881968Z         x0 = x[:, :D]
2025-05-07T20:32:59.3882195Z         x1 = x[:, D:]
2025-05-07T20:32:59.3882411Z     
2025-05-07T20:32:59.3882603Z         if contiguous:
2025-05-07T20:32:59.3882849Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3883121Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3883392Z     
2025-05-07T20:32:59.3883596Z         if scale_ub is not None:
2025-05-07T20:32:59.3883881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3884364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3884689Z             )
2025-05-07T20:32:59.3884894Z         else:
2025-05-07T20:32:59.3885110Z             scale_ub_tensor = None
2025-05-07T20:32:59.3885373Z     
2025-05-07T20:32:59.3885618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3885942Z             op = silu_mul_quant
2025-05-07T20:32:59.3886206Z             if compiled:
2025-05-07T20:32:59.3886465Z                 op = torch.compile(op)
2025-05-07T20:32:59.3886777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3887056Z     
2025-05-07T20:32:59.3887262Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.3887561Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.3887859Z     
2025-05-07T20:32:59.3888108Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3888453Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.3888763Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.3889090Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.3889465Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.3889778Z     
2025-05-07T20:32:59.3889994Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.3890423Z 
2025-05-07T20:32:59.3890533Z moe/activation_test.py:126: 
2025-05-07T20:32:59.3890843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3891183Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.3891526Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.3892325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.3893082Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.3893642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3894389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3895088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.3895849Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.3896625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:59.3897382Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.3898112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.3898751Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.3899368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.3899898Z     fn()
2025-05-07T20:32:59.3900407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.3901134Z     self.fn.run(
2025-05-07T20:32:59.3901612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3902153Z     kernel = self.compile(
2025-05-07T20:32:59.3902697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3903367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3903778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3904012Z 
2025-05-07T20:32:59.3904224Z self = <triton.compiler.compiler.ASTSource object at 0x7f159048b280>
2025-05-07T20:32:59.3905367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3906820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15903f6550>}
2025-05-07T20:32:59.3908178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3909202Z context = <triton._C.libtriton.ir.context object at 0x7f15903dedb0>
2025-05-07T20:32:59.3909497Z 
2025-05-07T20:32:59.3909670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3910207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3910696Z                            module_map=module_map)
2025-05-07T20:32:59.3911074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3911439Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.3911723Z E       ^
2025-05-07T20:32:59.3912244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3912741Z 
2025-05-07T20:32:59.3913160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3913688Z 
2025-05-07T20:32:59.3913797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3914219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3914631Z     T=2048,
2025-05-07T20:32:59.3914826Z     D=5120,
2025-05-07T20:32:59.3915031Z     scale_ub=None,
2025-05-07T20:32:59.3915257Z     contiguous=True,
2025-05-07T20:32:59.3915484Z     compiled=True,
2025-05-07T20:32:59.3915754Z )
2025-05-07T20:33:00.0038868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0040752Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.0041315Z 
2025-05-07T20:33:00.0041494Z     @given(
2025-05-07T20:33:00.0042015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0042653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0043291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0043981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0044648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0045233Z     )
2025-05-07T20:33:00.0045919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0046427Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0046682Z         self,
2025-05-07T20:33:00.0046888Z         T: int,
2025-05-07T20:33:00.0047099Z         D: int,
2025-05-07T20:33:00.0047339Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0047628Z         contiguous: bool,
2025-05-07T20:33:00.0047875Z         compiled: bool,
2025-05-07T20:33:00.0048117Z     ) -> None:
2025-05-07T20:33:00.0048348Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0048604Z     
2025-05-07T20:33:00.0048893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0049252Z     
2025-05-07T20:33:00.0049456Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.0049752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.0050078Z         x = x_sign * x_clamp
2025-05-07T20:33:00.0050333Z         x0 = x[:, :D]
2025-05-07T20:33:00.0050560Z         x1 = x[:, D:]
2025-05-07T20:33:00.0050778Z     
2025-05-07T20:33:00.0050974Z         if contiguous:
2025-05-07T20:33:00.0051213Z             x0 = x0.contiguous()
2025-05-07T20:33:00.0051490Z             x1 = x1.contiguous()
2025-05-07T20:33:00.0051745Z     
2025-05-07T20:33:00.0052260Z         if scale_ub is not None:
2025-05-07T20:33:00.0052554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.0052911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.0053229Z             )
2025-05-07T20:33:00.0053436Z         else:
2025-05-07T20:33:00.0053665Z             scale_ub_tensor = None
2025-05-07T20:33:00.0053926Z     
2025-05-07T20:33:00.0054178Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0054511Z             op = silu_mul_quant
2025-05-07T20:33:00.0054779Z             if compiled:
2025-05-07T20:33:00.0055038Z                 op = torch.compile(op)
2025-05-07T20:33:00.0055350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0055643Z     
2025-05-07T20:33:00.0055839Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.0056146Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.0056452Z     
2025-05-07T20:33:00.0056694Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0057048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.0057355Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.0057675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.0058051Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0058546Z     
2025-05-07T20:33:00.0058764Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.0058965Z 
2025-05-07T20:33:00.0059072Z moe/activation_test.py:126: 
2025-05-07T20:33:00.0059380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0059731Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.0060070Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0060873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.0061714Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.0062356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.0063046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.0063746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.0064482Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.0065236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.0066003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.0066741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.0067390Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.0068000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.0068529Z     fn()
2025-05-07T20:33:00.0069056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.0069652Z     self.fn.run(
2025-05-07T20:33:00.0070128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.0070669Z     kernel = self.compile(
2025-05-07T20:33:00.0071229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.0071887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.0072298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0072541Z 
2025-05-07T20:33:00.0072802Z self = <triton.compiler.compiler.ASTSource object at 0x7f159044b160>
2025-05-07T20:33:00.0073894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.0075310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590015f70>}
2025-05-07T20:33:00.0076676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.0077701Z context = <triton._C.libtriton.ir.context object at 0x7f159018edb0>
2025-05-07T20:33:00.0077994Z 
2025-05-07T20:33:00.0078179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.0078720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.0079198Z                            module_map=module_map)
2025-05-07T20:33:00.0079580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.0079950Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.0080315Z E       ^
2025-05-07T20:33:00.0080789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.0081243Z 
2025-05-07T20:33:00.0081671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.0082187Z 
2025-05-07T20:33:00.0082303Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0082722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0083138Z     T=128,
2025-05-07T20:33:00.0083342Z     D=5120,
2025-05-07T20:33:00.0083541Z     scale_ub=None,
2025-05-07T20:33:00.0083839Z     contiguous=True,
2025-05-07T20:33:00.0084077Z     compiled=True,
2025-05-07T20:33:00.0084286Z )
2025-05-07T20:33:00.9990724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.9991509Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.9991930Z 
2025-05-07T20:33:00.9992041Z     @given(
2025-05-07T20:33:00.9992374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.9992789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.9993197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.9993544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.9993876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.9994174Z     )
2025-05-07T20:33:00.9994531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.9994985Z     def test_silu_mul_quant(
2025-05-07T20:33:00.9995253Z         self,
2025-05-07T20:33:00.9995454Z         T: int,
2025-05-07T20:33:00.9995657Z         D: int,
2025-05-07T20:33:01.0002779Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0003118Z         contiguous: bool,
2025-05-07T20:33:01.0003392Z         compiled: bool,
2025-05-07T20:33:01.0003650Z     ) -> None:
2025-05-07T20:33:01.0003883Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0004141Z     
2025-05-07T20:33:01.0004436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0004788Z     
2025-05-07T20:33:01.0004994Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0005300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0005619Z         x = x_sign * x_clamp
2025-05-07T20:33:01.0005873Z         x0 = x[:, :D]
2025-05-07T20:33:01.0006106Z         x1 = x[:, D:]
2025-05-07T20:33:01.0006319Z     
2025-05-07T20:33:01.0006516Z         if contiguous:
2025-05-07T20:33:01.0006762Z             x0 = x0.contiguous()
2025-05-07T20:33:01.0007175Z             x1 = x1.contiguous()
2025-05-07T20:33:01.0007436Z     
2025-05-07T20:33:01.0007641Z         if scale_ub is not None:
2025-05-07T20:33:01.0007929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.0008271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.0008596Z             )
2025-05-07T20:33:01.0008806Z         else:
2025-05-07T20:33:01.0009021Z             scale_ub_tensor = None
2025-05-07T20:33:01.0009284Z     
2025-05-07T20:33:01.0009533Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0009856Z             op = silu_mul_quant
2025-05-07T20:33:01.0010120Z             if compiled:
2025-05-07T20:33:01.0010380Z                 op = torch.compile(op)
2025-05-07T20:33:01.0010682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0010971Z     
2025-05-07T20:33:01.0011176Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.0011467Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.0011776Z     
2025-05-07T20:33:01.0012026Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0012373Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.0012674Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.0012998Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.0013511Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.0013832Z     
2025-05-07T20:33:01.0014051Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.0014253Z 
2025-05-07T20:33:01.0014366Z moe/activation_test.py:126: 
2025-05-07T20:33:01.0014668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0015018Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.0015358Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.0016158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.0017056Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.0017622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.0018320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.0019019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.0019749Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.0020511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.0021397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.0022130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.0022789Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.0023400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.0023937Z     fn()
2025-05-07T20:33:01.0024456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.0025055Z     self.fn.run(
2025-05-07T20:33:01.0025535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.0026068Z     kernel = self.compile(
2025-05-07T20:33:01.0026626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.0027341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.0027752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0028042Z 
2025-05-07T20:33:01.0028259Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590297fd0>
2025-05-07T20:33:01.0029372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.0030773Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590293b80>}
2025-05-07T20:33:01.0032110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.0033147Z context = <triton._C.libtriton.ir.context object at 0x7f158fd4b6f0>
2025-05-07T20:33:01.0033440Z 
2025-05-07T20:33:01.0033621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.0034156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.0034632Z                            module_map=module_map)
2025-05-07T20:33:01.0035004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.0035452Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.0035736Z E       ^
2025-05-07T20:33:01.0036202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.0036660Z 
2025-05-07T20:33:01.0037078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.0037600Z 
2025-05-07T20:33:01.0037707Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0038133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0038582Z     T=4096,
2025-05-07T20:33:01.0038785Z     D=5120,
2025-05-07T20:33:01.0038988Z     scale_ub=None,
2025-05-07T20:33:01.0039206Z     contiguous=True,
2025-05-07T20:33:01.0039446Z     compiled=True,
2025-05-07T20:33:01.0039663Z )
2025-05-07T20:33:01.8501079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8501896Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.8502285Z 
2025-05-07T20:33:01.8502427Z     @given(
2025-05-07T20:33:01.8502699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8503034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8503365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8503705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8504054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8504356Z     )
2025-05-07T20:33:01.8504728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8505191Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8505448Z         self,
2025-05-07T20:33:01.8505660Z         T: int,
2025-05-07T20:33:01.8505862Z         D: int,
2025-05-07T20:33:01.8506099Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8506383Z         contiguous: bool,
2025-05-07T20:33:01.8506632Z         compiled: bool,
2025-05-07T20:33:01.8506875Z     ) -> None:
2025-05-07T20:33:01.8507108Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8507356Z     
2025-05-07T20:33:01.8507640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8507997Z     
2025-05-07T20:33:01.8508194Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8508499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8508812Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8509065Z         x0 = x[:, :D]
2025-05-07T20:33:01.8509294Z         x1 = x[:, D:]
2025-05-07T20:33:01.8509512Z     
2025-05-07T20:33:01.8509836Z         if contiguous:
2025-05-07T20:33:01.8510084Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8510353Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8510597Z     
2025-05-07T20:33:01.8510798Z         if scale_ub is not None:
2025-05-07T20:33:01.8511082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8511427Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8511749Z             )
2025-05-07T20:33:01.8511952Z         else:
2025-05-07T20:33:01.8512166Z             scale_ub_tensor = None
2025-05-07T20:33:01.8512426Z     
2025-05-07T20:33:01.8512674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8512996Z             op = silu_mul_quant
2025-05-07T20:33:01.8513259Z             if compiled:
2025-05-07T20:33:01.8513517Z                 op = torch.compile(op)
2025-05-07T20:33:01.8513816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8514101Z     
2025-05-07T20:33:01.8514302Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.8514605Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.8514897Z     
2025-05-07T20:33:01.8515139Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8515481Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.8515845Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.8516260Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.8516626Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.8516937Z     
2025-05-07T20:33:01.8517147Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.8517346Z 
2025-05-07T20:33:01.8517458Z moe/activation_test.py:126: 
2025-05-07T20:33:01.8517754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8518109Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.8518450Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.8519329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.8520080Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.8520637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8521327Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8522021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.8522741Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.8523493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.8524247Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.8524984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.8525629Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.8526244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.8526803Z     fn()
2025-05-07T20:33:01.8527332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.8527921Z     self.fn.run(
2025-05-07T20:33:01.8528397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8528936Z     kernel = self.compile(
2025-05-07T20:33:01.8529481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8530144Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8530604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8530840Z 
2025-05-07T20:33:01.8531050Z self = <triton.compiler.compiler.ASTSource object at 0x7f158fcdfb50>
2025-05-07T20:33:01.8532139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8533592Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158fddeca0>}
2025-05-07T20:33:01.8534929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8535958Z context = <triton._C.libtriton.ir.context object at 0x7f158f975c30>
2025-05-07T20:33:01.8536251Z 
2025-05-07T20:33:01.8536433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8536960Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8537565Z                            module_map=module_map)
2025-05-07T20:33:01.8537945Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8538305Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.8538582Z E       ^
2025-05-07T20:33:01.8539056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8539506Z 
2025-05-07T20:33:01.8539936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8540716Z 
2025-05-07T20:33:01.8540823Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8541378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8541785Z     T=16384,
2025-05-07T20:33:01.8541983Z     D=5120,
2025-05-07T20:33:01.8542182Z     scale_ub=None,
2025-05-07T20:33:01.8542403Z     contiguous=True,
2025-05-07T20:33:01.8542629Z     compiled=True,
2025-05-07T20:33:01.8542844Z )
2025-05-07T20:33:01.8975028Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:01.8977203Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:01.8978530Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:01.8979526Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:01.8980642Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:02.0179959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.0180782Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:02.0181294Z 
2025-05-07T20:33:02.0181410Z     @given(
2025-05-07T20:33:02.0181768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.0182249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.0182629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.0182986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.0183335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.0183780Z     )
2025-05-07T20:33:02.0184149Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.0184613Z     def test_silu_mul_quant(
2025-05-07T20:33:02.0184878Z         self,
2025-05-07T20:33:02.0185083Z         T: int,
2025-05-07T20:33:02.0185300Z         D: int,
2025-05-07T20:33:02.0185555Z         scale_ub: Optional[float],
2025-05-07T20:33:02.0185842Z         contiguous: bool,
2025-05-07T20:33:02.0186104Z         compiled: bool,
2025-05-07T20:33:02.0186347Z     ) -> None:
2025-05-07T20:33:02.0186576Z         torch.manual_seed(2025)
2025-05-07T20:33:02.0186881Z     
2025-05-07T20:33:02.0187181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.0187537Z     
2025-05-07T20:33:02.0187748Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.0188059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.0188395Z         x = x_sign * x_clamp
2025-05-07T20:33:02.0188644Z         x0 = x[:, :D]
2025-05-07T20:33:02.0188884Z         x1 = x[:, D:]
2025-05-07T20:33:02.0189111Z     
2025-05-07T20:33:02.0189304Z         if contiguous:
2025-05-07T20:33:02.0189551Z             x0 = x0.contiguous()
2025-05-07T20:33:02.0189823Z             x1 = x1.contiguous()
2025-05-07T20:33:02.0190069Z     
2025-05-07T20:33:02.0190342Z         if scale_ub is not None:
2025-05-07T20:33:02.0190686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.0191028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.0191353Z             )
2025-05-07T20:33:02.0191563Z         else:
2025-05-07T20:33:02.0191780Z             scale_ub_tensor = None
2025-05-07T20:33:02.0192044Z     
2025-05-07T20:33:02.0192290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.0192608Z             op = silu_mul_quant
2025-05-07T20:33:02.0192873Z             if compiled:
2025-05-07T20:33:02.0193135Z                 op = torch.compile(op)
2025-05-07T20:33:02.0193448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.0193797Z     
2025-05-07T20:33:02.0194003Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.0194303Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.0194598Z     
2025-05-07T20:33:02.0194850Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.0195202Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.0195503Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.0195833Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.0196209Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.0196530Z     
2025-05-07T20:33:02.0196745Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.0196949Z 
2025-05-07T20:33:02.0197064Z moe/activation_test.py:126: 
2025-05-07T20:33:02.0197384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0197729Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.0198083Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.0198886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.0199638Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.0200208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.0200895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.0201589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.0202309Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.0203066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:02.0203881Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.0204622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.0205260Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.0205874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.0206402Z     fn()
2025-05-07T20:33:02.0206906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.0207500Z     self.fn.run(
2025-05-07T20:33:02.0207976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.0208709Z     kernel = self.compile(
2025-05-07T20:33:02.0209256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.0209923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.0210336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0210570Z 
2025-05-07T20:33:02.0210781Z self = <triton.compiler.compiler.ASTSource object at 0x7f159027a550>
2025-05-07T20:33:02.0211950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.0213361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590293ca0>}
2025-05-07T20:33:02.0214706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.0215766Z context = <triton._C.libtriton.ir.context object at 0x7f158f5c9930>
2025-05-07T20:33:02.0216059Z 
2025-05-07T20:33:02.0216231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.0216764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.0217243Z                            module_map=module_map)
2025-05-07T20:33:02.0217619Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.0217978Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.0218261Z E       ^
2025-05-07T20:33:02.0218734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.0219184Z 
2025-05-07T20:33:02.0219611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.0220127Z 
2025-05-07T20:33:02.0220236Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.0220662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.0221132Z     T=1,
2025-05-07T20:33:02.0221324Z     D=5120,
2025-05-07T20:33:02.0221533Z     scale_ub=1200.0,
2025-05-07T20:33:02.0221777Z     contiguous=True,
2025-05-07T20:33:02.0222004Z     compiled=True,
2025-05-07T20:33:02.0222221Z )
2025-05-07T20:33:02.1922969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.1924420Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.1925158Z 
2025-05-07T20:33:02.1925397Z     @given(
2025-05-07T20:33:02.1926028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.1926716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.1927070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.1927411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.1927886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.1928202Z     )
2025-05-07T20:33:02.1928559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.1929025Z     def test_silu_mul_quant(
2025-05-07T20:33:02.1929284Z         self,
2025-05-07T20:33:02.1929497Z         T: int,
2025-05-07T20:33:02.1929719Z         D: int,
2025-05-07T20:33:02.1929961Z         scale_ub: Optional[float],
2025-05-07T20:33:02.1930250Z         contiguous: bool,
2025-05-07T20:33:02.1930501Z         compiled: bool,
2025-05-07T20:33:02.1930744Z     ) -> None:
2025-05-07T20:33:02.1930972Z         torch.manual_seed(2025)
2025-05-07T20:33:02.1931219Z     
2025-05-07T20:33:02.1931503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.1931854Z     
2025-05-07T20:33:02.1932047Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.1932349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.1932676Z         x = x_sign * x_clamp
2025-05-07T20:33:02.1932921Z         x0 = x[:, :D]
2025-05-07T20:33:02.1933147Z         x1 = x[:, D:]
2025-05-07T20:33:02.1933364Z     
2025-05-07T20:33:02.1933563Z         if contiguous:
2025-05-07T20:33:02.1933807Z             x0 = x0.contiguous()
2025-05-07T20:33:02.1934263Z             x1 = x1.contiguous()
2025-05-07T20:33:02.1934514Z     
2025-05-07T20:33:02.1934719Z         if scale_ub is not None:
2025-05-07T20:33:02.1935003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.1935346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.1935669Z             )
2025-05-07T20:33:02.1935872Z         else:
2025-05-07T20:33:02.1936095Z             scale_ub_tensor = None
2025-05-07T20:33:02.1936352Z     
2025-05-07T20:33:02.1936593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.1936917Z             op = silu_mul_quant
2025-05-07T20:33:02.1937173Z             if compiled:
2025-05-07T20:33:02.1937505Z                 op = torch.compile(op)
2025-05-07T20:33:02.1937817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1938096Z     
2025-05-07T20:33:02.1938303Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.1938476Z 
2025-05-07T20:33:02.1938590Z moe/activation_test.py:117: 
2025-05-07T20:33:02.1938898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1939245Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.1939540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1940371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.1940946Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.1941671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.1942365Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.1942906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.1943606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.1944282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.1944823Z     kernel = self.compile(
2025-05-07T20:33:02.1945361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.1946022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.1946427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1946663Z 
2025-05-07T20:33:02.1946880Z self = <triton.compiler.compiler.ASTSource object at 0x7f158fcc7040>
2025-05-07T20:33:02.1948025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.1949428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f6568b0>}
2025-05-07T20:33:02.1950775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.1951794Z context = <triton._C.libtriton.ir.context object at 0x7f158eff37f0>
2025-05-07T20:33:02.1952083Z 
2025-05-07T20:33:02.1952256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.1952790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.1953264Z                            module_map=module_map)
2025-05-07T20:33:02.1953647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.1954002Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.1954271Z E       ^
2025-05-07T20:33:02.1954737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.1955306Z 
2025-05-07T20:33:02.1955729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.1956249Z 
2025-05-07T20:33:02.1956356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1956775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1957200Z     T=1,
2025-05-07T20:33:02.1957421Z     D=5120,
2025-05-07T20:33:02.1957627Z     scale_ub=None,
2025-05-07T20:33:02.1957854Z     contiguous=False,
2025-05-07T20:33:02.1958083Z     compiled=True,
2025-05-07T20:33:02.1958291Z )
2025-05-07T20:33:02.2761041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2761907Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.2762286Z 
2025-05-07T20:33:02.2762401Z     @given(
2025-05-07T20:33:02.2762744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2763224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2763589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2763943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2764292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2764584Z     )
2025-05-07T20:33:02.2764951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2765411Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2765669Z         self,
2025-05-07T20:33:02.2765872Z         T: int,
2025-05-07T20:33:02.2766081Z         D: int,
2025-05-07T20:33:02.2766318Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2766604Z         contiguous: bool,
2025-05-07T20:33:02.2766859Z         compiled: bool,
2025-05-07T20:33:02.2767125Z     ) -> None:
2025-05-07T20:33:02.2767358Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2767606Z     
2025-05-07T20:33:02.2767894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2768262Z     
2025-05-07T20:33:02.2768459Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2768766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2769093Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2769341Z         x0 = x[:, :D]
2025-05-07T20:33:02.2769568Z         x1 = x[:, D:]
2025-05-07T20:33:02.2769788Z     
2025-05-07T20:33:02.2769980Z         if contiguous:
2025-05-07T20:33:02.2770225Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2770495Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2770739Z     
2025-05-07T20:33:02.2770943Z         if scale_ub is not None:
2025-05-07T20:33:02.2771314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2771670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2771987Z             )
2025-05-07T20:33:02.2772195Z         else:
2025-05-07T20:33:02.2772417Z             scale_ub_tensor = None
2025-05-07T20:33:02.2772673Z     
2025-05-07T20:33:02.2772926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2773251Z             op = silu_mul_quant
2025-05-07T20:33:02.2773510Z             if compiled:
2025-05-07T20:33:02.2773769Z                 op = torch.compile(op)
2025-05-07T20:33:02.2774078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2774359Z     
2025-05-07T20:33:02.2774563Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.2774860Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.2775156Z     
2025-05-07T20:33:02.2775406Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2775753Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.2776062Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.2776391Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.2776766Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.2777093Z     
2025-05-07T20:33:02.2777367Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.2777631Z 
2025-05-07T20:33:02.2777739Z moe/activation_test.py:126: 
2025-05-07T20:33:02.2778053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2778398Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.2778739Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.2779544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.2780310Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.2780871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2781701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2782398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.2783132Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.2783898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:02.2784653Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.2785391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.2786033Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.2786654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.2787190Z     fn()
2025-05-07T20:33:02.2787706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.2788289Z     self.fn.run(
2025-05-07T20:33:02.2788770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2789309Z     kernel = self.compile(
2025-05-07T20:33:02.2789861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2790528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2790938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2791175Z 
2025-05-07T20:33:02.2791393Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f67ba00>
2025-05-07T20:33:02.2792516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2793907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f6ace50>}
2025-05-07T20:33:02.2795268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2796295Z context = <triton._C.libtriton.ir.context object at 0x7f158ef335f0>
2025-05-07T20:33:02.2796588Z 
2025-05-07T20:33:02.2796768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2797301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2797781Z                            module_map=module_map)
2025-05-07T20:33:02.2798159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2798525Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.2798801Z E       ^
2025-05-07T20:33:02.2799313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2799799Z 
2025-05-07T20:33:02.2800229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2800741Z 
2025-05-07T20:33:02.2800848Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2801268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2801678Z     T=1,
2025-05-07T20:33:02.2801867Z     D=5120,
2025-05-07T20:33:02.2802069Z     scale_ub=None,
2025-05-07T20:33:02.2802292Z     contiguous=True,
2025-05-07T20:33:02.2802567Z     compiled=False,
2025-05-07T20:33:02.2802784Z )
2025-05-07T20:33:02.6395601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6396407Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.6396792Z 
2025-05-07T20:33:02.6396906Z     @given(
2025-05-07T20:33:02.6397252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6397614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6397940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6398282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6398632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6398939Z     )
2025-05-07T20:33:02.6399307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6399764Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6400026Z         self,
2025-05-07T20:33:02.6400229Z         T: int,
2025-05-07T20:33:02.6400451Z         D: int,
2025-05-07T20:33:02.6400691Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6400976Z         contiguous: bool,
2025-05-07T20:33:02.6401234Z         compiled: bool,
2025-05-07T20:33:02.6401479Z     ) -> None:
2025-05-07T20:33:02.6401715Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6401969Z     
2025-05-07T20:33:02.6402252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6402608Z     
2025-05-07T20:33:02.6402803Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6403105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6403425Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6403670Z         x0 = x[:, :D]
2025-05-07T20:33:02.6403895Z         x1 = x[:, D:]
2025-05-07T20:33:02.6404113Z     
2025-05-07T20:33:02.6404301Z         if contiguous:
2025-05-07T20:33:02.6404547Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6404820Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6405076Z     
2025-05-07T20:33:02.6406031Z         if scale_ub is not None:
2025-05-07T20:33:02.6406327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6406671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6406993Z             )
2025-05-07T20:33:02.6407196Z         else:
2025-05-07T20:33:02.6407424Z             scale_ub_tensor = None
2025-05-07T20:33:02.6407679Z     
2025-05-07T20:33:02.6407923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6408251Z             op = silu_mul_quant
2025-05-07T20:33:02.6408506Z             if compiled:
2025-05-07T20:33:02.6408766Z                 op = torch.compile(op)
2025-05-07T20:33:02.6409073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6409354Z     
2025-05-07T20:33:02.6409560Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6409731Z 
2025-05-07T20:33:02.6409844Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6410152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6410500Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6410797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6411496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6412318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6412879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6413570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6414233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6414784Z     kernel = self.compile(
2025-05-07T20:33:02.6415336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6416002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6416470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6416713Z 
2025-05-07T20:33:02.6416927Z self = <triton.compiler.compiler.ASTSource object at 0x7f158efa0160>
2025-05-07T20:33:02.6418013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6419385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158fdb0790>}
2025-05-07T20:33:02.6420719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6421843Z context = <triton._C.libtriton.ir.context object at 0x7f158f2deb30>
2025-05-07T20:33:02.6422141Z 
2025-05-07T20:33:02.6422314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6422851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6423324Z                            module_map=module_map)
2025-05-07T20:33:02.6423709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6424080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6424355Z E       ^
2025-05-07T20:33:02.6424818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6425273Z 
2025-05-07T20:33:02.6425690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6426211Z 
2025-05-07T20:33:02.6426376Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6426795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6427211Z     T=128,
2025-05-07T20:33:02.6427411Z     D=5120,
2025-05-07T20:33:02.6427614Z     scale_ub=None,
2025-05-07T20:33:02.6427834Z     contiguous=False,
2025-05-07T20:33:02.6428073Z     compiled=True,
2025-05-07T20:33:02.6428291Z )
2025-05-07T20:33:02.6428623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6429125Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.6429397Z 
2025-05-07T20:33:02.6429484Z     @given(
2025-05-07T20:33:02.6429721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6430045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6430362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6430699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6431053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6431350Z     )
2025-05-07T20:33:02.6431702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6432154Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6432405Z         self,
2025-05-07T20:33:02.6432603Z         T: int,
2025-05-07T20:33:02.6432927Z         D: int,
2025-05-07T20:33:02.6433156Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6433432Z         contiguous: bool,
2025-05-07T20:33:02.6433682Z         compiled: bool,
2025-05-07T20:33:02.6433916Z     ) -> None:
2025-05-07T20:33:02.6434145Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6434395Z     
2025-05-07T20:33:02.6434671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6435021Z     
2025-05-07T20:33:02.6435215Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6435514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6435837Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6436129Z         x0 = x[:, :D]
2025-05-07T20:33:02.6436355Z         x1 = x[:, D:]
2025-05-07T20:33:02.6436568Z     
2025-05-07T20:33:02.6436753Z         if contiguous:
2025-05-07T20:33:02.6436992Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6437263Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6437508Z     
2025-05-07T20:33:02.6437713Z         if scale_ub is not None:
2025-05-07T20:33:02.6437994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6438338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6438659Z             )
2025-05-07T20:33:02.6438861Z         else:
2025-05-07T20:33:02.6439073Z             scale_ub_tensor = None
2025-05-07T20:33:02.6439334Z     
2025-05-07T20:33:02.6439576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6439902Z             op = silu_mul_quant
2025-05-07T20:33:02.6440411Z             if compiled:
2025-05-07T20:33:02.6440669Z                 op = torch.compile(op)
2025-05-07T20:33:02.6440979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6441257Z     
2025-05-07T20:33:02.6441460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6441629Z 
2025-05-07T20:33:02.6441740Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6442039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6442384Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6442676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6443244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.6443818Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.6444486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6445189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6445808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6446503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6447171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6447715Z     kernel = self.compile(
2025-05-07T20:33:02.6448268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6448937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6449347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6449581Z 
2025-05-07T20:33:02.6449794Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f30cc40>
2025-05-07T20:33:02.6450887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6452264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ee51040>}
2025-05-07T20:33:02.6453668Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6454758Z context = <triton._C.libtriton.ir.context object at 0x7f158eeb9170>
2025-05-07T20:33:02.6455051Z 
2025-05-07T20:33:02.6455222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6455756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6456230Z                            module_map=module_map)
2025-05-07T20:33:02.6456615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6457038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6457307Z E       ^
2025-05-07T20:33:02.6457777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6458224Z 
2025-05-07T20:33:02.6458648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6459173Z 
2025-05-07T20:33:02.6459281Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6459708Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6460117Z     T=128,
2025-05-07T20:33:02.6460307Z     D=7168,
2025-05-07T20:33:02.6460507Z     scale_ub=1200.0,
2025-05-07T20:33:02.6460741Z     contiguous=False,
2025-05-07T20:33:02.6460973Z     compiled=False,
2025-05-07T20:33:02.6461254Z )
2025-05-07T20:33:02.7992891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7993681Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.7994082Z 
2025-05-07T20:33:02.7994225Z     @given(
2025-05-07T20:33:02.7994566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.7995027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.7995475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.7995847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.7996197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.7996500Z     )
2025-05-07T20:33:02.7996857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.7997317Z     def test_silu_mul_quant(
2025-05-07T20:33:02.7997570Z         self,
2025-05-07T20:33:02.7997775Z         T: int,
2025-05-07T20:33:02.7997989Z         D: int,
2025-05-07T20:33:02.7998226Z         scale_ub: Optional[float],
2025-05-07T20:33:02.7998506Z         contiguous: bool,
2025-05-07T20:33:02.7998878Z         compiled: bool,
2025-05-07T20:33:02.7999119Z     ) -> None:
2025-05-07T20:33:02.7999341Z         torch.manual_seed(2025)
2025-05-07T20:33:02.7999594Z     
2025-05-07T20:33:02.7999879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8000237Z     
2025-05-07T20:33:02.8000447Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8000757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8001073Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8001326Z         x0 = x[:, :D]
2025-05-07T20:33:02.8001556Z         x1 = x[:, D:]
2025-05-07T20:33:02.8001770Z     
2025-05-07T20:33:02.8002016Z         if contiguous:
2025-05-07T20:33:02.8002260Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8002534Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8002785Z     
2025-05-07T20:33:02.8002978Z         if scale_ub is not None:
2025-05-07T20:33:02.8003269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8003636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8003959Z             )
2025-05-07T20:33:02.8004158Z         else:
2025-05-07T20:33:02.8004387Z             scale_ub_tensor = None
2025-05-07T20:33:02.8004646Z     
2025-05-07T20:33:02.8004881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8005334Z             op = silu_mul_quant
2025-05-07T20:33:02.8005602Z             if compiled:
2025-05-07T20:33:02.8005853Z                 op = torch.compile(op)
2025-05-07T20:33:02.8006158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8006438Z     
2025-05-07T20:33:02.8006633Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8006805Z 
2025-05-07T20:33:02.8006909Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8007212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8007550Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8007840Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8008608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8009305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8009846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8010538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8011205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8011744Z     kernel = self.compile(
2025-05-07T20:33:02.8012287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8012950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8013363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8013602Z 
2025-05-07T20:33:02.8013812Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f47f850>
2025-05-07T20:33:02.8014910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8016282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ee51ca0>}
2025-05-07T20:33:02.8017623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8018644Z context = <triton._C.libtriton.ir.context object at 0x7f158ef01cf0>
2025-05-07T20:33:02.8018934Z 
2025-05-07T20:33:02.8019157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8019704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8020175Z                            module_map=module_map)
2025-05-07T20:33:02.8020549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8020920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8021270Z E       ^
2025-05-07T20:33:02.8021751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8022200Z 
2025-05-07T20:33:02.8022623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8023142Z 
2025-05-07T20:33:02.8023252Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8023675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8024088Z     T=128,
2025-05-07T20:33:02.8024281Z     D=5120,
2025-05-07T20:33:02.8024480Z     scale_ub=None,
2025-05-07T20:33:02.8024703Z     contiguous=False,
2025-05-07T20:33:02.8024936Z     compiled=False,
2025-05-07T20:33:02.8025151Z )
2025-05-07T20:33:02.8025476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8026056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.8026341Z 
2025-05-07T20:33:02.8026425Z     @given(
2025-05-07T20:33:02.8026666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8026982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8027297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8027633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8027972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8028261Z     )
2025-05-07T20:33:02.8028620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8029115Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8029362Z         self,
2025-05-07T20:33:02.8029572Z         T: int,
2025-05-07T20:33:02.8029783Z         D: int,
2025-05-07T20:33:02.8030004Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8030282Z         contiguous: bool,
2025-05-07T20:33:02.8030535Z         compiled: bool,
2025-05-07T20:33:02.8030761Z     ) -> None:
2025-05-07T20:33:02.8030981Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8037790Z     
2025-05-07T20:33:02.8038101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8038457Z     
2025-05-07T20:33:02.8038661Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8038961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8039285Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8039539Z         x0 = x[:, :D]
2025-05-07T20:33:02.8039763Z         x1 = x[:, D:]
2025-05-07T20:33:02.8039970Z     
2025-05-07T20:33:02.8040424Z         if contiguous:
2025-05-07T20:33:02.8040673Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8040935Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8041186Z     
2025-05-07T20:33:02.8041389Z         if scale_ub is not None:
2025-05-07T20:33:02.8041667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8042020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8042338Z             )
2025-05-07T20:33:02.8042532Z         else:
2025-05-07T20:33:02.8042752Z             scale_ub_tensor = None
2025-05-07T20:33:02.8043019Z     
2025-05-07T20:33:02.8043256Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8043585Z             op = silu_mul_quant
2025-05-07T20:33:02.8043846Z             if compiled:
2025-05-07T20:33:02.8044103Z                 op = torch.compile(op)
2025-05-07T20:33:02.8044407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8044696Z     
2025-05-07T20:33:02.8045011Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8045186Z 
2025-05-07T20:33:02.8045290Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8045596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8045940Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8046224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8046928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8047622Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8048167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8048852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8049529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8050067Z     kernel = self.compile(
2025-05-07T20:33:02.8050621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8051288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8051688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8052041Z 
2025-05-07T20:33:02.8052260Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f4619a0>
2025-05-07T20:33:02.8053344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8054714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f3fd310>}
2025-05-07T20:33:02.8056058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8057149Z context = <triton._C.libtriton.ir.context object at 0x7f158f3ea270>
2025-05-07T20:33:02.8057440Z 
2025-05-07T20:33:02.8057624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8058152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8058623Z                            module_map=module_map)
2025-05-07T20:33:02.8059002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8059359Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8059630Z E       ^
2025-05-07T20:33:02.8060098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8060552Z 
2025-05-07T20:33:02.8060979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8061570Z 
2025-05-07T20:33:02.8061679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8062096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8062508Z     T=128,
2025-05-07T20:33:02.8062702Z     D=5120,
2025-05-07T20:33:02.8062908Z     scale_ub=1200.0,
2025-05-07T20:33:02.8063139Z     contiguous=True,
2025-05-07T20:33:02.8063365Z     compiled=False,
2025-05-07T20:33:02.8063585Z )
2025-05-07T20:33:03.0361973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0363493Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.0364273Z 
2025-05-07T20:33:03.0364497Z     @given(
2025-05-07T20:33:03.0364983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0365620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0366587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0367059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0367397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0367692Z     )
2025-05-07T20:33:03.0368069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0368524Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0368778Z         self,
2025-05-07T20:33:03.0368990Z         T: int,
2025-05-07T20:33:03.0369198Z         D: int,
2025-05-07T20:33:03.0369435Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0369726Z         contiguous: bool,
2025-05-07T20:33:03.0369982Z         compiled: bool,
2025-05-07T20:33:03.0370213Z     ) -> None:
2025-05-07T20:33:03.0370448Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0370710Z     
2025-05-07T20:33:03.0370993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0371346Z     
2025-05-07T20:33:03.0371567Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0371870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0372198Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0372454Z         x0 = x[:, :D]
2025-05-07T20:33:03.0372678Z         x1 = x[:, D:]
2025-05-07T20:33:03.0372962Z     
2025-05-07T20:33:03.0373211Z         if contiguous:
2025-05-07T20:33:03.0373454Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0373729Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0373980Z     
2025-05-07T20:33:03.0374180Z         if scale_ub is not None:
2025-05-07T20:33:03.0374471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0374832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0375161Z             )
2025-05-07T20:33:03.0375360Z         else:
2025-05-07T20:33:03.0375587Z             scale_ub_tensor = None
2025-05-07T20:33:03.0375851Z     
2025-05-07T20:33:03.0376091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0376495Z             op = silu_mul_quant
2025-05-07T20:33:03.0376768Z             if compiled:
2025-05-07T20:33:03.0377024Z                 op = torch.compile(op)
2025-05-07T20:33:03.0377339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0377633Z     
2025-05-07T20:33:03.0377833Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.0378006Z 
2025-05-07T20:33:03.0378112Z moe/activation_test.py:117: 
2025-05-07T20:33:03.0378417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0378757Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.0379055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0379757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.0380447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.0381059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0381751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0382418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0382961Z     kernel = self.compile(
2025-05-07T20:33:03.0383519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0384184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0384585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0384827Z 
2025-05-07T20:33:03.0385038Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f3d4940>
2025-05-07T20:33:03.0386176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0387546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f3fdee0>}
2025-05-07T20:33:03.0388884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0389916Z context = <triton._C.libtriton.ir.context object at 0x7f158ede76f0>
2025-05-07T20:33:03.0390210Z 
2025-05-07T20:33:03.0390379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0390914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0391377Z                            module_map=module_map)
2025-05-07T20:33:03.0391762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0392127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.0392399Z E       ^
2025-05-07T20:33:03.0392858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0393308Z 
2025-05-07T20:33:03.0393804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0394318Z 
2025-05-07T20:33:03.0394431Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.0394852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.0395253Z     T=1,
2025-05-07T20:33:03.0395443Z     D=7168,
2025-05-07T20:33:03.0395644Z     scale_ub=1200.0,
2025-05-07T20:33:03.0395870Z     contiguous=True,
2025-05-07T20:33:03.0396101Z     compiled=True,
2025-05-07T20:33:03.0396312Z )
2025-05-07T20:33:03.0396631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0397172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.0397438Z 
2025-05-07T20:33:03.0397524Z     @given(
2025-05-07T20:33:03.0397756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0398076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0398394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0398728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0399058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0399355Z     )
2025-05-07T20:33:03.0399714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0400156Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0400404Z         self,
2025-05-07T20:33:03.0400604Z         T: int,
2025-05-07T20:33:03.0400804Z         D: int,
2025-05-07T20:33:03.0401031Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0401309Z         contiguous: bool,
2025-05-07T20:33:03.0401559Z         compiled: bool,
2025-05-07T20:33:03.0401792Z     ) -> None:
2025-05-07T20:33:03.0402018Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0402260Z     
2025-05-07T20:33:03.0402538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0402888Z     
2025-05-07T20:33:03.0403087Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0403388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0403708Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0403959Z         x0 = x[:, :D]
2025-05-07T20:33:03.0404177Z         x1 = x[:, D:]
2025-05-07T20:33:03.0404393Z     
2025-05-07T20:33:03.0404585Z         if contiguous:
2025-05-07T20:33:03.0404820Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0405087Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0405330Z     
2025-05-07T20:33:03.0405524Z         if scale_ub is not None:
2025-05-07T20:33:03.0405806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0406204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0406517Z             )
2025-05-07T20:33:03.0406719Z         else:
2025-05-07T20:33:03.0406937Z             scale_ub_tensor = None
2025-05-07T20:33:03.0407192Z     
2025-05-07T20:33:03.0407455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0407809Z             op = silu_mul_quant
2025-05-07T20:33:03.0408066Z             if compiled:
2025-05-07T20:33:03.0408322Z                 op = torch.compile(op)
2025-05-07T20:33:03.0408628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0408915Z     
2025-05-07T20:33:03.0409109Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.0409279Z 
2025-05-07T20:33:03.0409394Z moe/activation_test.py:117: 
2025-05-07T20:33:03.0409689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0410036Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.0410326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0410904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.0411461Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.0412170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.0412923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.0413462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0414153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0414827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0415362Z     kernel = self.compile(
2025-05-07T20:33:03.0415902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0416604Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0417031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0417290Z 
2025-05-07T20:33:03.0417508Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f28cd00>
2025-05-07T20:33:03.0418592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0419952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158edd6940>}
2025-05-07T20:33:03.0421360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0422380Z context = <triton._C.libtriton.ir.context object at 0x7f158f2107b0>
2025-05-07T20:33:03.0422669Z 
2025-05-07T20:33:03.0422840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0423370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0423852Z                            module_map=module_map)
2025-05-07T20:33:03.0424229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0424583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.0424851Z E       ^
2025-05-07T20:33:03.0425313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0425760Z 
2025-05-07T20:33:03.0426175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0426704Z 
2025-05-07T20:33:03.0426861Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.0427285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.0427692Z     T=1,
2025-05-07T20:33:03.0427878Z     D=7168,
2025-05-07T20:33:03.0428079Z     scale_ub=1200.0,
2025-05-07T20:33:03.0428315Z     contiguous=False,
2025-05-07T20:33:03.0428548Z     compiled=True,
2025-05-07T20:33:03.0428762Z )
2025-05-07T20:33:03.3854468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.3855196Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.3855615Z 
2025-05-07T20:33:03.3855728Z     @given(
2025-05-07T20:33:03.3856049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.3856494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.3856811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.3857154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.3857504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.3857799Z     )
2025-05-07T20:33:03.3858156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.3858614Z     def test_silu_mul_quant(
2025-05-07T20:33:03.3858857Z         self,
2025-05-07T20:33:03.3859241Z         T: int,
2025-05-07T20:33:03.3859445Z         D: int,
2025-05-07T20:33:03.3859668Z         scale_ub: Optional[float],
2025-05-07T20:33:03.3859953Z         contiguous: bool,
2025-05-07T20:33:03.3860199Z         compiled: bool,
2025-05-07T20:33:03.3860431Z     ) -> None:
2025-05-07T20:33:03.3860649Z         torch.manual_seed(2025)
2025-05-07T20:33:03.3860894Z     
2025-05-07T20:33:03.3861291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.3861636Z     
2025-05-07T20:33:03.3861833Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.3862136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.3862543Z         x = x_sign * x_clamp
2025-05-07T20:33:03.3862789Z         x0 = x[:, :D]
2025-05-07T20:33:03.3863012Z         x1 = x[:, D:]
2025-05-07T20:33:03.3863219Z     
2025-05-07T20:33:03.3863414Z         if contiguous:
2025-05-07T20:33:03.3863651Z             x0 = x0.contiguous()
2025-05-07T20:33:03.3863913Z             x1 = x1.contiguous()
2025-05-07T20:33:03.3864172Z     
2025-05-07T20:33:03.3864369Z         if scale_ub is not None:
2025-05-07T20:33:03.3864645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.3864991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.3865308Z             )
2025-05-07T20:33:03.3865504Z         else:
2025-05-07T20:33:03.3865721Z             scale_ub_tensor = None
2025-05-07T20:33:03.3865980Z     
2025-05-07T20:33:03.3866222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.3866541Z             op = silu_mul_quant
2025-05-07T20:33:03.3866803Z             if compiled:
2025-05-07T20:33:03.3867083Z                 op = torch.compile(op)
2025-05-07T20:33:03.3867382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3867672Z     
2025-05-07T20:33:03.3867874Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.3868044Z 
2025-05-07T20:33:03.3868158Z moe/activation_test.py:117: 
2025-05-07T20:33:03.3868465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3868807Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.3869097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3869654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.3870218Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.3870880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.3871589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.3872203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.3872905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.3873572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.3874111Z     kernel = self.compile(
2025-05-07T20:33:03.3874659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.3875321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.3875729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3875961Z 
2025-05-07T20:33:03.3876174Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f1e6b50>
2025-05-07T20:33:03.3877258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.3878630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f1f05e0>}
2025-05-07T20:33:03.3880066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.3881080Z context = <triton._C.libtriton.ir.context object at 0x7f158f08d7f0>
2025-05-07T20:33:03.3881374Z 
2025-05-07T20:33:03.3881545Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.3882075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.3882551Z                            module_map=module_map)
2025-05-07T20:33:03.3882963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.3883322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.3883590Z E       ^
2025-05-07T20:33:03.3884053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.3884513Z 
2025-05-07T20:33:03.3884940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.3885457Z 
2025-05-07T20:33:03.3885564Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.3885983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.3886383Z     T=1,
2025-05-07T20:33:03.3886573Z     D=7168,
2025-05-07T20:33:03.3886779Z     scale_ub=None,
2025-05-07T20:33:03.3886999Z     contiguous=False,
2025-05-07T20:33:03.3887231Z     compiled=True,
2025-05-07T20:33:03.3887442Z )
2025-05-07T20:33:03.5020354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5021251Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5021630Z 
2025-05-07T20:33:03.5021743Z     @given(
2025-05-07T20:33:03.5022020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5022354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5022667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5023007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5023345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5023637Z     )
2025-05-07T20:33:03.5023994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5024445Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5024694Z         self,
2025-05-07T20:33:03.5024892Z         T: int,
2025-05-07T20:33:03.5025098Z         D: int,
2025-05-07T20:33:03.5025323Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5025710Z         contiguous: bool,
2025-05-07T20:33:03.5025960Z         compiled: bool,
2025-05-07T20:33:03.5026192Z     ) -> None:
2025-05-07T20:33:03.5026412Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5026662Z     
2025-05-07T20:33:03.5026946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5027298Z     
2025-05-07T20:33:03.5027496Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5027798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5028117Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5028367Z         x0 = x[:, :D]
2025-05-07T20:33:03.5028588Z         x1 = x[:, D:]
2025-05-07T20:33:03.5028796Z     
2025-05-07T20:33:03.5028988Z         if contiguous:
2025-05-07T20:33:03.5029235Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5029499Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5029755Z     
2025-05-07T20:33:03.5029955Z         if scale_ub is not None:
2025-05-07T20:33:03.5030242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5030585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5030909Z             )
2025-05-07T20:33:03.5031111Z         else:
2025-05-07T20:33:03.5031326Z             scale_ub_tensor = None
2025-05-07T20:33:03.5031586Z     
2025-05-07T20:33:03.5031952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5032277Z             op = silu_mul_quant
2025-05-07T20:33:03.5032536Z             if compiled:
2025-05-07T20:33:03.5032791Z                 op = torch.compile(op)
2025-05-07T20:33:03.5033089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5033374Z     
2025-05-07T20:33:03.5033575Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5033867Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5034168Z     
2025-05-07T20:33:03.5034411Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5034750Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5035124Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5035448Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5035816Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5036129Z     
2025-05-07T20:33:03.5036337Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5036545Z 
2025-05-07T20:33:03.5036655Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5036954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5037298Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5037635Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5038425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5039176Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5046187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5046939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5047640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5048369Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5049125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:03.5049872Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5050603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5051240Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5051958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5052499Z     fn()
2025-05-07T20:33:03.5053020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5053603Z     self.fn.run(
2025-05-07T20:33:03.5054081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5054608Z     kernel = self.compile(
2025-05-07T20:33:03.5055157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5055810Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5056206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5056444Z 
2025-05-07T20:33:03.5056656Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f0aa550>
2025-05-07T20:33:03.5057793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5059238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f042160>}
2025-05-07T20:33:03.5060644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5061754Z context = <triton._C.libtriton.ir.context object at 0x7f158f03db30>
2025-05-07T20:33:03.5062047Z 
2025-05-07T20:33:03.5062216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5062751Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5063286Z                            module_map=module_map)
2025-05-07T20:33:03.5063662Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5064029Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5064302Z E       ^
2025-05-07T20:33:03.5064770Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5065235Z 
2025-05-07T20:33:03.5065656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5066168Z 
2025-05-07T20:33:03.5066276Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5066693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5067097Z     T=1,
2025-05-07T20:33:03.5067288Z     D=5120,
2025-05-07T20:33:03.5067484Z     scale_ub=1200.0,
2025-05-07T20:33:03.5067710Z     contiguous=False,
2025-05-07T20:33:03.5067946Z     compiled=True,
2025-05-07T20:33:03.5068153Z )
2025-05-07T20:33:03.7054222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7055011Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.7055391Z 
2025-05-07T20:33:03.7055505Z     @given(
2025-05-07T20:33:03.7055844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7056291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7056607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7056956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7057303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7057598Z     )
2025-05-07T20:33:03.7057962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7058426Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7058680Z         self,
2025-05-07T20:33:03.7058883Z         T: int,
2025-05-07T20:33:03.7059217Z         D: int,
2025-05-07T20:33:03.7059449Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7059730Z         contiguous: bool,
2025-05-07T20:33:03.7059984Z         compiled: bool,
2025-05-07T20:33:03.7060225Z     ) -> None:
2025-05-07T20:33:03.7060447Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7060704Z     
2025-05-07T20:33:03.7061072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7061425Z     
2025-05-07T20:33:03.7061633Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7061947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7062266Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7062520Z         x0 = x[:, :D]
2025-05-07T20:33:03.7062750Z         x1 = x[:, D:]
2025-05-07T20:33:03.7062962Z     
2025-05-07T20:33:03.7063159Z         if contiguous:
2025-05-07T20:33:03.7063403Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7063669Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7063918Z     
2025-05-07T20:33:03.7064123Z         if scale_ub is not None:
2025-05-07T20:33:03.7064407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7064752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7065072Z             )
2025-05-07T20:33:03.7065279Z         else:
2025-05-07T20:33:03.7065629Z             scale_ub_tensor = None
2025-05-07T20:33:03.7065900Z     
2025-05-07T20:33:03.7066144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7066461Z             op = silu_mul_quant
2025-05-07T20:33:03.7066727Z             if compiled:
2025-05-07T20:33:03.7066985Z                 op = torch.compile(op)
2025-05-07T20:33:03.7067288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7067574Z     
2025-05-07T20:33:03.7067777Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7067947Z 
2025-05-07T20:33:03.7068052Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7068361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7068786Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7069082Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7069647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.7070208Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.7070889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7071576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7072122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7072810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7073478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7074015Z     kernel = self.compile(
2025-05-07T20:33:03.7074572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7075246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7075650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7075901Z 
2025-05-07T20:33:03.7076116Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f43a190>
2025-05-07T20:33:03.7077201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7078582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f042b80>}
2025-05-07T20:33:03.7079985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7081019Z context = <triton._C.libtriton.ir.context object at 0x7f158ead80f0>
2025-05-07T20:33:03.7081322Z 
2025-05-07T20:33:03.7081504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7082050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7082522Z                            module_map=module_map)
2025-05-07T20:33:03.7082897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7083261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7083538Z E       ^
2025-05-07T20:33:03.7084006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7084456Z 
2025-05-07T20:33:03.7084891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7085406Z 
2025-05-07T20:33:03.7085518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7085937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7086396Z     T=1,
2025-05-07T20:33:03.7086631Z     D=5120,
2025-05-07T20:33:03.7086832Z     scale_ub=1200.0,
2025-05-07T20:33:03.7087071Z     contiguous=False,
2025-05-07T20:33:03.7087311Z     compiled=False,
2025-05-07T20:33:03.7087521Z )
2025-05-07T20:33:03.7087847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7088380Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.7088657Z 
2025-05-07T20:33:03.7088738Z     @given(
2025-05-07T20:33:03.7088982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7089306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7089677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7090022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7090355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7090651Z     )
2025-05-07T20:33:03.7091024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7091474Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7091725Z         self,
2025-05-07T20:33:03.7091931Z         T: int,
2025-05-07T20:33:03.7092132Z         D: int,
2025-05-07T20:33:03.7092360Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7092638Z         contiguous: bool,
2025-05-07T20:33:03.7092884Z         compiled: bool,
2025-05-07T20:33:03.7093114Z     ) -> None:
2025-05-07T20:33:03.7093343Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7093595Z     
2025-05-07T20:33:03.7093872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7094226Z     
2025-05-07T20:33:03.7094434Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7094731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7095054Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7095305Z         x0 = x[:, :D]
2025-05-07T20:33:03.7095526Z         x1 = x[:, D:]
2025-05-07T20:33:03.7095745Z     
2025-05-07T20:33:03.7095946Z         if contiguous:
2025-05-07T20:33:03.7096182Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7096458Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7096708Z     
2025-05-07T20:33:03.7096911Z         if scale_ub is not None:
2025-05-07T20:33:03.7097194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7097546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7097859Z             )
2025-05-07T20:33:03.7098064Z         else:
2025-05-07T20:33:03.7098283Z             scale_ub_tensor = None
2025-05-07T20:33:03.7098539Z     
2025-05-07T20:33:03.7098784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7099164Z             op = silu_mul_quant
2025-05-07T20:33:03.7099432Z             if compiled:
2025-05-07T20:33:03.7099686Z                 op = torch.compile(op)
2025-05-07T20:33:03.7099993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7100279Z     
2025-05-07T20:33:03.7100479Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7100659Z 
2025-05-07T20:33:03.7100766Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7101143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7101480Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7101775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7102473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7103166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7103709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7104398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7105068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7105602Z     kernel = self.compile(
2025-05-07T20:33:03.7106269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7106940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7107399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7107635Z 
2025-05-07T20:33:03.7107844Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f224220>
2025-05-07T20:33:03.7108931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7110335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f44c550>}
2025-05-07T20:33:03.7111677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7112709Z context = <triton._C.libtriton.ir.context object at 0x7f158f417930>
2025-05-07T20:33:03.7113003Z 
2025-05-07T20:33:03.7113175Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7113709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7114182Z                            module_map=module_map)
2025-05-07T20:33:03.7114567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7114939Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7115210Z E       ^
2025-05-07T20:33:03.7115678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7116137Z 
2025-05-07T20:33:03.7116565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7117081Z 
2025-05-07T20:33:03.7117191Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7117617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7118026Z     T=16384,
2025-05-07T20:33:03.7118227Z     D=5120,
2025-05-07T20:33:03.7118430Z     scale_ub=1200.0,
2025-05-07T20:33:03.7118659Z     contiguous=False,
2025-05-07T20:33:03.7118893Z     compiled=True,
2025-05-07T20:33:03.7119106Z )
2025-05-07T20:33:03.8295310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8296216Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.8296616Z 
2025-05-07T20:33:03.8296728Z     @given(
2025-05-07T20:33:03.8297043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8297370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8297693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8298041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8298389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8298685Z     )
2025-05-07T20:33:03.8299042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8299490Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8299743Z         self,
2025-05-07T20:33:03.8299943Z         T: int,
2025-05-07T20:33:03.8300153Z         D: int,
2025-05-07T20:33:03.8300381Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8300659Z         contiguous: bool,
2025-05-07T20:33:03.8300916Z         compiled: bool,
2025-05-07T20:33:03.8301217Z     ) -> None:
2025-05-07T20:33:03.8301434Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8301690Z     
2025-05-07T20:33:03.8301972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8302409Z     
2025-05-07T20:33:03.8302677Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8302988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8303309Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8303563Z         x0 = x[:, :D]
2025-05-07T20:33:03.8303789Z         x1 = x[:, D:]
2025-05-07T20:33:03.8304000Z     
2025-05-07T20:33:03.8304193Z         if contiguous:
2025-05-07T20:33:03.8304433Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8304700Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8304952Z     
2025-05-07T20:33:03.8305155Z         if scale_ub is not None:
2025-05-07T20:33:03.8305435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8305855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8306176Z             )
2025-05-07T20:33:03.8306381Z         else:
2025-05-07T20:33:03.8306598Z             scale_ub_tensor = None
2025-05-07T20:33:03.8306857Z     
2025-05-07T20:33:03.8307103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8307479Z             op = silu_mul_quant
2025-05-07T20:33:03.8307744Z             if compiled:
2025-05-07T20:33:03.8308000Z                 op = torch.compile(op)
2025-05-07T20:33:03.8308301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8308587Z     
2025-05-07T20:33:03.8308789Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8308959Z 
2025-05-07T20:33:03.8309065Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8309372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8309714Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8310007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8310575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8311138Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8311807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8312498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8313040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8313728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8314403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8314937Z     kernel = self.compile(
2025-05-07T20:33:03.8315489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8316201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8316613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8316852Z 
2025-05-07T20:33:03.8317066Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f134a00>
2025-05-07T20:33:03.8318154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8319524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea901f0>}
2025-05-07T20:33:03.8320875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8321892Z context = <triton._C.libtriton.ir.context object at 0x7f158eacb970>
2025-05-07T20:33:03.8322189Z 
2025-05-07T20:33:03.8322360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8322934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8323453Z                            module_map=module_map)
2025-05-07T20:33:03.8323824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8324189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8324463Z E       ^
2025-05-07T20:33:03.8324925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8325384Z 
2025-05-07T20:33:03.8325802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8326374Z 
2025-05-07T20:33:03.8326485Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.8326906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.8327313Z     T=2048,
2025-05-07T20:33:03.8327545Z     D=7168,
2025-05-07T20:33:03.8327762Z     scale_ub=1200.0,
2025-05-07T20:33:03.8327995Z     contiguous=False,
2025-05-07T20:33:03.8328238Z     compiled=True,
2025-05-07T20:33:03.8328452Z )
2025-05-07T20:33:03.8328780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8329285Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.8329567Z 
2025-05-07T20:33:03.8329648Z     @given(
2025-05-07T20:33:03.8329883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8330199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8330514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8330852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8331190Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8331485Z     )
2025-05-07T20:33:03.8331844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8332288Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8332539Z         self,
2025-05-07T20:33:03.8332746Z         T: int,
2025-05-07T20:33:03.8332950Z         D: int,
2025-05-07T20:33:03.8333177Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8333460Z         contiguous: bool,
2025-05-07T20:33:03.8333701Z         compiled: bool,
2025-05-07T20:33:03.8333931Z     ) -> None:
2025-05-07T20:33:03.8334157Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8334411Z     
2025-05-07T20:33:03.8334688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8335037Z     
2025-05-07T20:33:03.8335239Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8335535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8335909Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8336160Z         x0 = x[:, :D]
2025-05-07T20:33:03.8336381Z         x1 = x[:, D:]
2025-05-07T20:33:03.8336599Z     
2025-05-07T20:33:03.8336793Z         if contiguous:
2025-05-07T20:33:03.8337027Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8337314Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8337608Z     
2025-05-07T20:33:03.8337807Z         if scale_ub is not None:
2025-05-07T20:33:03.8338091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8338459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8338784Z             )
2025-05-07T20:33:03.8338984Z         else:
2025-05-07T20:33:03.8339199Z             scale_ub_tensor = None
2025-05-07T20:33:03.8339460Z     
2025-05-07T20:33:03.8339696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8340019Z             op = silu_mul_quant
2025-05-07T20:33:03.8340481Z             if compiled:
2025-05-07T20:33:03.8340742Z                 op = torch.compile(op)
2025-05-07T20:33:03.8341113Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8341403Z     
2025-05-07T20:33:03.8341599Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8341774Z 
2025-05-07T20:33:03.8341876Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8342312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8342659Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8342944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8343512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8344081Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8344747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8345444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8345999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8346755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8347419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8347972Z     kernel = self.compile(
2025-05-07T20:33:03.8348520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8349186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8349589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8349831Z 
2025-05-07T20:33:03.8350042Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eaad700>
2025-05-07T20:33:03.8351128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8352505Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea90ee0>}
2025-05-07T20:33:03.8353846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8354881Z context = <triton._C.libtriton.ir.context object at 0x7f158f165730>
2025-05-07T20:33:03.8355174Z 
2025-05-07T20:33:03.8355344Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8355875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8356346Z                            module_map=module_map)
2025-05-07T20:33:03.8356790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8357152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8357442Z E       ^
2025-05-07T20:33:03.8357933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8358396Z 
2025-05-07T20:33:03.8358821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8359334Z 
2025-05-07T20:33:04.1016938Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.1017785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.1018348Z     T=1,
2025-05-07T20:33:04.1018614Z     D=5120,
2025-05-07T20:33:04.1018864Z     scale_ub=None,
2025-05-07T20:33:04.1019086Z     contiguous=False,
2025-05-07T20:33:04.1019320Z     compiled=False,
2025-05-07T20:33:04.1019531Z )
2025-05-07T20:33:04.1019870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.1027182Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.1027459Z 
2025-05-07T20:33:04.1027546Z     @given(
2025-05-07T20:33:04.1027791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.1028291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.1028620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.1028968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.1029296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.1029598Z     )
2025-05-07T20:33:04.1029960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.1030408Z     def test_silu_mul_quant(
2025-05-07T20:33:04.1030660Z         self,
2025-05-07T20:33:04.1030866Z         T: int,
2025-05-07T20:33:04.1031068Z         D: int,
2025-05-07T20:33:04.1031294Z         scale_ub: Optional[float],
2025-05-07T20:33:04.1031647Z         contiguous: bool,
2025-05-07T20:33:04.1031889Z         compiled: bool,
2025-05-07T20:33:04.1032130Z     ) -> None:
2025-05-07T20:33:04.1032356Z         torch.manual_seed(2025)
2025-05-07T20:33:04.1032603Z     
2025-05-07T20:33:04.1032885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.1033245Z     
2025-05-07T20:33:04.1033447Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.1033739Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.1034062Z         x = x_sign * x_clamp
2025-05-07T20:33:04.1034322Z         x0 = x[:, :D]
2025-05-07T20:33:04.1034542Z         x1 = x[:, D:]
2025-05-07T20:33:04.1034759Z     
2025-05-07T20:33:04.1034955Z         if contiguous:
2025-05-07T20:33:04.1035192Z             x0 = x0.contiguous()
2025-05-07T20:33:04.1035458Z             x1 = x1.contiguous()
2025-05-07T20:33:04.1035710Z     
2025-05-07T20:33:04.1035907Z         if scale_ub is not None:
2025-05-07T20:33:04.1036200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.1036552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.1036868Z             )
2025-05-07T20:33:04.1037071Z         else:
2025-05-07T20:33:04.1037296Z             scale_ub_tensor = None
2025-05-07T20:33:04.1037551Z     
2025-05-07T20:33:04.1037798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.1038124Z             op = silu_mul_quant
2025-05-07T20:33:04.1038385Z             if compiled:
2025-05-07T20:33:04.1038641Z                 op = torch.compile(op)
2025-05-07T20:33:04.1038948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1039242Z     
2025-05-07T20:33:04.1039434Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.1039602Z 
2025-05-07T20:33:04.1039706Z moe/activation_test.py:117: 
2025-05-07T20:33:04.1040017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1040560Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.1040930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1041642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.1042336Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.1042882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.1043587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.1044259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.1044794Z     kernel = self.compile(
2025-05-07T20:33:04.1045343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.1046008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.1046425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1046662Z 
2025-05-07T20:33:04.1046874Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eaaf8b0>
2025-05-07T20:33:04.1048029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.1049475Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f1585e0>}
2025-05-07T20:33:04.1050820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.1051838Z context = <triton._C.libtriton.ir.context object at 0x7f158f0e4fb0>
2025-05-07T20:33:04.1052188Z 
2025-05-07T20:33:04.1052363Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.1052905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.1053381Z                            module_map=module_map)
2025-05-07T20:33:04.1053765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.1054139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.1054406Z E       ^
2025-05-07T20:33:04.1054875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.1055331Z 
2025-05-07T20:33:04.1055753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.1056273Z 
2025-05-07T20:33:04.1056380Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.1056831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.1057243Z     T=4096,
2025-05-07T20:33:04.1057437Z     D=7168,
2025-05-07T20:33:04.1057639Z     scale_ub=1200.0,
2025-05-07T20:33:04.1057877Z     contiguous=False,
2025-05-07T20:33:04.1058106Z     compiled=False,
2025-05-07T20:33:04.1058321Z )
2025-05-07T20:33:04.1058651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.1059167Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.1059445Z 
2025-05-07T20:33:04.1059524Z     @given(
2025-05-07T20:33:04.1059764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.1060092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.1060411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.1060751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.1061186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.1061482Z     )
2025-05-07T20:33:04.1061893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.1062343Z     def test_silu_mul_quant(
2025-05-07T20:33:04.1062591Z         self,
2025-05-07T20:33:04.1062789Z         T: int,
2025-05-07T20:33:04.1062995Z         D: int,
2025-05-07T20:33:04.1063220Z         scale_ub: Optional[float],
2025-05-07T20:33:04.1063508Z         contiguous: bool,
2025-05-07T20:33:04.1063754Z         compiled: bool,
2025-05-07T20:33:04.1063988Z     ) -> None:
2025-05-07T20:33:04.1064208Z         torch.manual_seed(2025)
2025-05-07T20:33:04.1064463Z     
2025-05-07T20:33:04.1064746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.1065096Z     
2025-05-07T20:33:04.1065301Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.1065602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.1065921Z         x = x_sign * x_clamp
2025-05-07T20:33:04.1066174Z         x0 = x[:, :D]
2025-05-07T20:33:04.1066402Z         x1 = x[:, D:]
2025-05-07T20:33:04.1066623Z     
2025-05-07T20:33:04.1066821Z         if contiguous:
2025-05-07T20:33:04.1067063Z             x0 = x0.contiguous()
2025-05-07T20:33:04.1067326Z             x1 = x1.contiguous()
2025-05-07T20:33:04.1067578Z     
2025-05-07T20:33:04.1067784Z         if scale_ub is not None:
2025-05-07T20:33:04.1068153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.1068503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.1068817Z             )
2025-05-07T20:33:04.1069017Z         else:
2025-05-07T20:33:04.1069238Z             scale_ub_tensor = None
2025-05-07T20:33:04.1069505Z     
2025-05-07T20:33:04.1069739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.1070062Z             op = silu_mul_quant
2025-05-07T20:33:04.1070321Z             if compiled:
2025-05-07T20:33:04.1070574Z                 op = torch.compile(op)
2025-05-07T20:33:04.1070884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1071275Z     
2025-05-07T20:33:04.1071470Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.1071644Z 
2025-05-07T20:33:04.1071745Z moe/activation_test.py:117: 
2025-05-07T20:33:04.1072050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1072392Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.1072688Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1073386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.1074080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.1074629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.1075321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.1075986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.1076530Z     kernel = self.compile(
2025-05-07T20:33:04.1077078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.1077734Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.1078151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1078386Z 
2025-05-07T20:33:04.1078602Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ed12580>
2025-05-07T20:33:04.1079686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.1081052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158eb0d1f0>}
2025-05-07T20:33:04.1082435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.1083463Z context = <triton._C.libtriton.ir.context object at 0x7f158eb27370>
2025-05-07T20:33:04.1083757Z 
2025-05-07T20:33:04.1083933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.1084461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.1084931Z                            module_map=module_map)
2025-05-07T20:33:04.1085305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.1085662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.1085929Z E       ^
2025-05-07T20:33:04.1086401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.1086859Z 
2025-05-07T20:33:04.1087275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.1087806Z 
2025-05-07T20:33:04.1087913Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.1088378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.1088825Z     T=16384,
2025-05-07T20:33:04.1089021Z     D=7168,
2025-05-07T20:33:04.1089219Z     scale_ub=None,
2025-05-07T20:33:04.1089438Z     contiguous=True,
2025-05-07T20:33:04.1089661Z     compiled=True,
2025-05-07T20:33:04.1089868Z )
2025-05-07T20:33:04.3950676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3951429Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.3951833Z 
2025-05-07T20:33:04.3951957Z     @given(
2025-05-07T20:33:04.3952286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3952884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3953204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3953550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3953885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3954183Z     )
2025-05-07T20:33:04.3954549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3955001Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3955252Z         self,
2025-05-07T20:33:04.3955460Z         T: int,
2025-05-07T20:33:04.3955662Z         D: int,
2025-05-07T20:33:04.3955895Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3956178Z         contiguous: bool,
2025-05-07T20:33:04.3956423Z         compiled: bool,
2025-05-07T20:33:04.3956660Z     ) -> None:
2025-05-07T20:33:04.3956884Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3957133Z     
2025-05-07T20:33:04.3957427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3957784Z     
2025-05-07T20:33:04.3957989Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3958290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3958612Z         x = x_sign * x_clamp
2025-05-07T20:33:04.3958861Z         x0 = x[:, :D]
2025-05-07T20:33:04.3959090Z         x1 = x[:, D:]
2025-05-07T20:33:04.3959305Z     
2025-05-07T20:33:04.3959500Z         if contiguous:
2025-05-07T20:33:04.3959744Z             x0 = x0.contiguous()
2025-05-07T20:33:04.3960012Z             x1 = x1.contiguous()
2025-05-07T20:33:04.3960261Z     
2025-05-07T20:33:04.3960464Z         if scale_ub is not None:
2025-05-07T20:33:04.3960740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.3961086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.3961413Z             )
2025-05-07T20:33:04.3961611Z         else:
2025-05-07T20:33:04.3961831Z             scale_ub_tensor = None
2025-05-07T20:33:04.3962093Z     
2025-05-07T20:33:04.3962412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.3962740Z             op = silu_mul_quant
2025-05-07T20:33:04.3963006Z             if compiled:
2025-05-07T20:33:04.3963270Z                 op = torch.compile(op)
2025-05-07T20:33:04.3963578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3963878Z     
2025-05-07T20:33:04.3964085Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.3964255Z 
2025-05-07T20:33:04.3964360Z moe/activation_test.py:117: 
2025-05-07T20:33:04.3964660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3965004Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.3965291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3965861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.3966424Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.3967098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.3967802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.3968348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.3969105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.3969836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.3970381Z     kernel = self.compile(
2025-05-07T20:33:04.3970930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.3971592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.3971992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3972228Z 
2025-05-07T20:33:04.3972438Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eb33520>
2025-05-07T20:33:04.3973564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.3974933Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158eb0dee0>}
2025-05-07T20:33:04.3976272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.3977291Z context = <triton._C.libtriton.ir.context object at 0x7f158f29f6b0>
2025-05-07T20:33:04.3977608Z 
2025-05-07T20:33:04.3977808Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.3978349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.3978848Z                            module_map=module_map)
2025-05-07T20:33:04.3979220Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.3979581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.3979852Z E       ^
2025-05-07T20:33:04.3980323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.3980772Z 
2025-05-07T20:33:04.3981263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.3981783Z 
2025-05-07T20:33:04.3981889Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3982309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3982718Z     T=4096,
2025-05-07T20:33:04.3982909Z     D=5120,
2025-05-07T20:33:04.3983162Z     scale_ub=None,
2025-05-07T20:33:04.3983388Z     contiguous=False,
2025-05-07T20:33:04.3983616Z     compiled=True,
2025-05-07T20:33:04.3983826Z )
2025-05-07T20:33:04.3984153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3984659Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.3984944Z 
2025-05-07T20:33:04.3985025Z     @given(
2025-05-07T20:33:04.3985264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3985580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3985895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3986234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3986567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3986855Z     )
2025-05-07T20:33:04.3987214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3987712Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3987963Z         self,
2025-05-07T20:33:04.3988165Z         T: int,
2025-05-07T20:33:04.3988367Z         D: int,
2025-05-07T20:33:04.3988588Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3988865Z         contiguous: bool,
2025-05-07T20:33:04.3989111Z         compiled: bool,
2025-05-07T20:33:04.3989383Z     ) -> None:
2025-05-07T20:33:04.3989644Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3989896Z     
2025-05-07T20:33:04.3990167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3990518Z     
2025-05-07T20:33:04.3990716Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3991010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3991333Z         x = x_sign * x_clamp
2025-05-07T20:33:04.3991579Z         x0 = x[:, :D]
2025-05-07T20:33:04.3991801Z         x1 = x[:, D:]
2025-05-07T20:33:04.3992008Z     
2025-05-07T20:33:04.3992199Z         if contiguous:
2025-05-07T20:33:04.3992436Z             x0 = x0.contiguous()
2025-05-07T20:33:04.3992747Z             x1 = x1.contiguous()
2025-05-07T20:33:04.3992999Z     
2025-05-07T20:33:04.3993202Z         if scale_ub is not None:
2025-05-07T20:33:04.3993481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.3993829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.3994147Z             )
2025-05-07T20:33:04.3994343Z         else:
2025-05-07T20:33:04.3994561Z             scale_ub_tensor = None
2025-05-07T20:33:04.3994820Z     
2025-05-07T20:33:04.3995059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.3995382Z             op = silu_mul_quant
2025-05-07T20:33:04.3995642Z             if compiled:
2025-05-07T20:33:04.3995893Z                 op = torch.compile(op)
2025-05-07T20:33:04.3996202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3996484Z     
2025-05-07T20:33:04.3996682Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.3996851Z 
2025-05-07T20:33:04.3996954Z moe/activation_test.py:117: 
2025-05-07T20:33:04.3997258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3997595Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.3997882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3998447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.3999013Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.3999669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.4000357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.4000897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.4001580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.4002292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.4002844Z     kernel = self.compile(
2025-05-07T20:33:04.4003389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.4004056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.4004458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.4004697Z 
2025-05-07T20:33:04.4004910Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2b70d0>
2025-05-07T20:33:04.4005989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.4007361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f2b9940>}
2025-05-07T20:33:04.4008723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.4009799Z context = <triton._C.libtriton.ir.context object at 0x7f158ee396f0>
2025-05-07T20:33:04.4010133Z 
2025-05-07T20:33:04.4010305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.4010833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.4011303Z                            module_map=module_map)
2025-05-07T20:33:04.4011676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.4012039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.4012298Z E       ^
2025-05-07T20:33:04.4012770Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.4013271Z 
2025-05-07T20:33:04.4013689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.4014203Z 
2025-05-07T20:33:04.5958879Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5959319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5959903Z     T=4096,
2025-05-07T20:33:04.5960107Z     D=5120,
2025-05-07T20:33:04.5961106Z     scale_ub=1200.0,
2025-05-07T20:33:04.5961584Z     contiguous=False,
2025-05-07T20:33:04.5961958Z     compiled=False,
2025-05-07T20:33:04.5962296Z )
2025-05-07T20:33:04.5962812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5963569Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.5964053Z 
2025-05-07T20:33:04.5964180Z     @given(
2025-05-07T20:33:04.5964589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5965139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5965654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5966225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5966782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5967279Z     )
2025-05-07T20:33:04.5967886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5968663Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5969060Z         self,
2025-05-07T20:33:04.5969379Z         T: int,
2025-05-07T20:33:04.5969705Z         D: int,
2025-05-07T20:33:04.5970063Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5970520Z         contiguous: bool,
2025-05-07T20:33:04.5970923Z         compiled: bool,
2025-05-07T20:33:04.5971295Z     ) -> None:
2025-05-07T20:33:04.5971656Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5972063Z     
2025-05-07T20:33:04.5972898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5973482Z     
2025-05-07T20:33:04.5973802Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5974287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5974806Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5975222Z         x0 = x[:, :D]
2025-05-07T20:33:04.5975592Z         x1 = x[:, D:]
2025-05-07T20:33:04.5975930Z     
2025-05-07T20:33:04.5976235Z         if contiguous:
2025-05-07T20:33:04.5976626Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5977052Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5977463Z     
2025-05-07T20:33:04.5977779Z         if scale_ub is not None:
2025-05-07T20:33:04.5978231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5978820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5979363Z             )
2025-05-07T20:33:04.5979672Z         else:
2025-05-07T20:33:04.5980022Z             scale_ub_tensor = None
2025-05-07T20:33:04.5980460Z     
2025-05-07T20:33:04.5980832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5981506Z             op = silu_mul_quant
2025-05-07T20:33:04.5981928Z             if compiled:
2025-05-07T20:33:04.5982339Z                 op = torch.compile(op)
2025-05-07T20:33:04.5994338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5994853Z     
2025-05-07T20:33:04.5995183Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5995469Z 
2025-05-07T20:33:04.5995648Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5996155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5996771Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5997244Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5998465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5999707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.6000778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.6001993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.6003169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.6004117Z     kernel = self.compile(
2025-05-07T20:33:04.6005059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.6006205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.6006899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.6007305Z 
2025-05-07T20:33:04.6007656Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2bb7c0>
2025-05-07T20:33:04.6009614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.6012219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec343a0>}
2025-05-07T20:33:04.6014654Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.6016493Z context = <triton._C.libtriton.ir.context object at 0x7f158ec23630>
2025-05-07T20:33:04.6017000Z 
2025-05-07T20:33:04.6017292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.6018201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.6019098Z                            module_map=module_map)
2025-05-07T20:33:04.6019728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.6020325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.6020773Z E       ^
2025-05-07T20:33:04.6021685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.6022502Z 
2025-05-07T20:33:04.6023251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.6024163Z 
2025-05-07T20:33:04.6024334Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6025049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6025748Z     T=4096,
2025-05-07T20:33:04.6026053Z     D=5120,
2025-05-07T20:33:04.6026383Z     scale_ub=1200.0,
2025-05-07T20:33:04.6026761Z     contiguous=False,
2025-05-07T20:33:04.6027129Z     compiled=True,
2025-05-07T20:33:04.6027480Z )
2025-05-07T20:33:04.6028008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6028759Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.6029125Z 
2025-05-07T20:33:04.6029229Z     @given(
2025-05-07T20:33:04.6029626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6030151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6030577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6031042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6031494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6031895Z     )
2025-05-07T20:33:04.6032396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6033026Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6033382Z         self,
2025-05-07T20:33:04.6033667Z         T: int,
2025-05-07T20:33:04.6033979Z         D: int,
2025-05-07T20:33:04.6034424Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6034825Z         contiguous: bool,
2025-05-07T20:33:04.6035184Z         compiled: bool,
2025-05-07T20:33:04.6035524Z     ) -> None:
2025-05-07T20:33:04.6035843Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6036198Z     
2025-05-07T20:33:04.6036649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6037135Z     
2025-05-07T20:33:04.6037476Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.6037971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.6038460Z         x = x_sign * x_clamp
2025-05-07T20:33:04.6038849Z         x0 = x[:, :D]
2025-05-07T20:33:04.6039194Z         x1 = x[:, D:]
2025-05-07T20:33:04.6039517Z     
2025-05-07T20:33:04.6039819Z         if contiguous:
2025-05-07T20:33:04.6040855Z             x0 = x0.contiguous()
2025-05-07T20:33:04.6041274Z             x1 = x1.contiguous()
2025-05-07T20:33:04.6041673Z     
2025-05-07T20:33:04.6041996Z         if scale_ub is not None:
2025-05-07T20:33:04.6042403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.6042922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.6043425Z             )
2025-05-07T20:33:04.6043746Z         else:
2025-05-07T20:33:04.6044083Z             scale_ub_tensor = None
2025-05-07T20:33:04.6044479Z     
2025-05-07T20:33:04.6044850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.6045366Z             op = silu_mul_quant
2025-05-07T20:33:04.6045763Z             if compiled:
2025-05-07T20:33:04.6046170Z                 op = torch.compile(op)
2025-05-07T20:33:04.6046651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.6047116Z     
2025-05-07T20:33:04.6047455Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.6047768Z 
2025-05-07T20:33:04.6047927Z moe/activation_test.py:117: 
2025-05-07T20:33:04.6048404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.6049132Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.6049625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.6050578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.6051502Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.6052641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.6053828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.6054753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.6055950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.6057107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.6058076Z     kernel = self.compile(
2025-05-07T20:33:04.6059041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.6060190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.6060857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.6061480Z 
2025-05-07T20:33:04.6061911Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec39fd0>
2025-05-07T20:33:04.6063788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.6066257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec34280>}
2025-05-07T20:33:04.6068574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.6070457Z context = <triton._C.libtriton.ir.context object at 0x7f158ed4e5b0>
2025-05-07T20:33:04.6070974Z 
2025-05-07T20:33:04.6071258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.6072175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.6072994Z                            module_map=module_map)
2025-05-07T20:33:04.6073606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.6074172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.6074609Z E       ^
2025-05-07T20:33:04.6075397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.6076204Z 
2025-05-07T20:33:04.6076854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.6077718Z 
2025-05-07T20:33:04.8806604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8807285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8807790Z     T=2048,
2025-05-07T20:33:04.8808034Z     D=7168,
2025-05-07T20:33:04.8808245Z     scale_ub=1200.0,
2025-05-07T20:33:04.8808483Z     contiguous=False,
2025-05-07T20:33:04.8808732Z     compiled=False,
2025-05-07T20:33:04.8808962Z )
2025-05-07T20:33:04.8809292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8809845Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.8810139Z 
2025-05-07T20:33:04.8810224Z     @given(
2025-05-07T20:33:04.8810476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8810812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8811416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8811777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8812125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8812423Z     )
2025-05-07T20:33:04.8812791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8813266Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8813523Z         self,
2025-05-07T20:33:04.8813739Z         T: int,
2025-05-07T20:33:04.8813959Z         D: int,
2025-05-07T20:33:04.8814190Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8814482Z         contiguous: bool,
2025-05-07T20:33:04.8814739Z         compiled: bool,
2025-05-07T20:33:04.8814988Z     ) -> None:
2025-05-07T20:33:04.8815216Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8815479Z     
2025-05-07T20:33:04.8815770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8816124Z     
2025-05-07T20:33:04.8816347Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8816656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8816984Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8817245Z         x0 = x[:, :D]
2025-05-07T20:33:04.8817500Z         x1 = x[:, D:]
2025-05-07T20:33:04.8817748Z     
2025-05-07T20:33:04.8818057Z         if contiguous:
2025-05-07T20:33:04.8818388Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8818661Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8818923Z     
2025-05-07T20:33:04.8819135Z         if scale_ub is not None:
2025-05-07T20:33:04.8819423Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8819777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8820106Z             )
2025-05-07T20:33:04.8820319Z         else:
2025-05-07T20:33:04.8820540Z             scale_ub_tensor = None
2025-05-07T20:33:04.8820814Z     
2025-05-07T20:33:04.8821174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8821603Z             op = silu_mul_quant
2025-05-07T20:33:04.8821879Z             if compiled:
2025-05-07T20:33:04.8822142Z                 op = torch.compile(op)
2025-05-07T20:33:04.8822448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8822739Z     
2025-05-07T20:33:04.8822948Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.8823124Z 
2025-05-07T20:33:04.8823232Z moe/activation_test.py:117: 
2025-05-07T20:33:04.8823542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8823893Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.8824183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8824897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.8825606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.8826161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8826855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8827540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8828091Z     kernel = self.compile(
2025-05-07T20:33:04.8828654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8829317Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8829734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8829974Z 
2025-05-07T20:33:04.8830213Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2c3370>
2025-05-07T20:33:04.8831355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8832841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ed85670>}
2025-05-07T20:33:04.8834194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8835239Z context = <triton._C.libtriton.ir.context object at 0x7f158ece87b0>
2025-05-07T20:33:04.8835534Z 
2025-05-07T20:33:04.8835708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8836246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8836728Z                            module_map=module_map)
2025-05-07T20:33:04.8837101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8837475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.8837790Z E       ^
2025-05-07T20:33:04.8838274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8838726Z 
2025-05-07T20:33:04.8839201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8839765Z 
2025-05-07T20:33:04.8839874Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8840578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8840996Z     T=1,
2025-05-07T20:33:04.8841195Z     D=7168,
2025-05-07T20:33:04.8841402Z     scale_ub=None,
2025-05-07T20:33:04.8841633Z     contiguous=True,
2025-05-07T20:33:04.8841863Z     compiled=False,
2025-05-07T20:33:04.8842081Z )
2025-05-07T20:33:04.8842406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8842992Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.8843262Z 
2025-05-07T20:33:04.8843344Z     @given(
2025-05-07T20:33:04.8843586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8843905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8844228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8844574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8844907Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8845206Z     )
2025-05-07T20:33:04.8845570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8846027Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8846275Z         self,
2025-05-07T20:33:04.8846482Z         T: int,
2025-05-07T20:33:04.8846695Z         D: int,
2025-05-07T20:33:04.8846921Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8847204Z         contiguous: bool,
2025-05-07T20:33:04.8847463Z         compiled: bool,
2025-05-07T20:33:04.8847694Z     ) -> None:
2025-05-07T20:33:04.8847928Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8848188Z     
2025-05-07T20:33:04.8848463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8848821Z     
2025-05-07T20:33:04.8849034Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8849338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8849661Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8849918Z         x0 = x[:, :D]
2025-05-07T20:33:04.8850142Z         x1 = x[:, D:]
2025-05-07T20:33:04.8850365Z     
2025-05-07T20:33:04.8850566Z         if contiguous:
2025-05-07T20:33:04.8850812Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8851080Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8851336Z     
2025-05-07T20:33:04.8851543Z         if scale_ub is not None:
2025-05-07T20:33:04.8851825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8852251Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8852582Z             )
2025-05-07T20:33:04.8852780Z         else:
2025-05-07T20:33:04.8853006Z             scale_ub_tensor = None
2025-05-07T20:33:04.8853276Z     
2025-05-07T20:33:04.8853513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8853850Z             op = silu_mul_quant
2025-05-07T20:33:04.8854120Z             if compiled:
2025-05-07T20:33:04.8854376Z                 op = torch.compile(op)
2025-05-07T20:33:04.8854686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8854977Z     
2025-05-07T20:33:04.8855176Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.8855356Z 
2025-05-07T20:33:04.8855460Z moe/activation_test.py:117: 
2025-05-07T20:33:04.8855768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8856114Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.8856402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8857107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.8857853Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.8858395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8859248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8859933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8860477Z     kernel = self.compile(
2025-05-07T20:33:04.8861097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8861764Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8862175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8862464Z 
2025-05-07T20:33:04.8862693Z self = <triton.compiler.compiler.ASTSource object at 0x7f158edb0d30>
2025-05-07T20:33:04.8863778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8865153Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec5f280>}
2025-05-07T20:33:04.8866501Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8867550Z context = <triton._C.libtriton.ir.context object at 0x7f158ec601f0>
2025-05-07T20:33:04.8867868Z 
2025-05-07T20:33:04.8868042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8868584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8869063Z                            module_map=module_map)
2025-05-07T20:33:04.8869444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8869813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.8870089Z E       ^
2025-05-07T20:33:04.8870570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8871027Z 
2025-05-07T20:33:04.8871444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8871979Z 
2025-05-07T20:33:04.8872087Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8872518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8872936Z     T=16384,
2025-05-07T20:33:04.8873187Z     D=7168,
2025-05-07T20:33:04.8873399Z     scale_ub=1200.0,
2025-05-07T20:33:04.8873636Z     contiguous=False,
2025-05-07T20:33:04.8873869Z     compiled=True,
2025-05-07T20:33:04.8874089Z )
2025-05-07T20:33:05.0789293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.0790080Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.0790371Z 
2025-05-07T20:33:05.0790466Z     @given(
2025-05-07T20:33:05.0790708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.0791042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.0791369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.0791722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.0792066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.0792370Z     )
2025-05-07T20:33:05.0792734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.0793198Z     def test_silu_mul_quant(
2025-05-07T20:33:05.0793456Z         self,
2025-05-07T20:33:05.0793666Z         T: int,
2025-05-07T20:33:05.0793870Z         D: int,
2025-05-07T20:33:05.0794106Z         scale_ub: Optional[float],
2025-05-07T20:33:05.0794395Z         contiguous: bool,
2025-05-07T20:33:05.0794976Z         compiled: bool,
2025-05-07T20:33:05.0795221Z     ) -> None:
2025-05-07T20:33:05.0795454Z         torch.manual_seed(2025)
2025-05-07T20:33:05.0795702Z     
2025-05-07T20:33:05.0795989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.0796347Z     
2025-05-07T20:33:05.0796550Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.0796862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.0797192Z         x = x_sign * x_clamp
2025-05-07T20:33:05.0797447Z         x0 = x[:, :D]
2025-05-07T20:33:05.0797705Z         x1 = x[:, D:]
2025-05-07T20:33:05.0797946Z     
2025-05-07T20:33:05.0798149Z         if contiguous:
2025-05-07T20:33:05.0798480Z             x0 = x0.contiguous()
2025-05-07T20:33:05.0798762Z             x1 = x1.contiguous()
2025-05-07T20:33:05.0799021Z     
2025-05-07T20:33:05.0799223Z         if scale_ub is not None:
2025-05-07T20:33:05.0799521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.0799882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.0800203Z             )
2025-05-07T20:33:05.0800418Z         else:
2025-05-07T20:33:05.0800649Z             scale_ub_tensor = None
2025-05-07T20:33:05.0800912Z     
2025-05-07T20:33:05.0801162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.0801490Z             op = silu_mul_quant
2025-05-07T20:33:05.0801752Z             if compiled:
2025-05-07T20:33:05.0802014Z                 op = torch.compile(op)
2025-05-07T20:33:05.0802325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0802608Z     
2025-05-07T20:33:05.0802806Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.0802993Z 
2025-05-07T20:33:05.0803099Z moe/activation_test.py:117: 
2025-05-07T20:33:05.0803406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0803745Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.0804043Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0804615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.0805176Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.0805852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.0806543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.0807092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.0807772Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.0808529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.0809079Z     kernel = self.compile(
2025-05-07T20:33:05.0809627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.0810296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.0810700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0810935Z 
2025-05-07T20:33:05.0811154Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec82a00>
2025-05-07T20:33:05.0812227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.0813609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec5fee0>}
2025-05-07T20:33:05.0814962Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.0816066Z context = <triton._C.libtriton.ir.context object at 0x7f158eca40f0>
2025-05-07T20:33:05.0816359Z 
2025-05-07T20:33:05.0816537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.0817062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.0817532Z                            module_map=module_map)
2025-05-07T20:33:05.0817956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.0818313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.0818581Z E       ^
2025-05-07T20:33:05.0819052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.0819550Z 
2025-05-07T20:33:05.0819982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.0820495Z 
2025-05-07T20:33:05.0820608Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.0821112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.0821523Z     T=1,
2025-05-07T20:33:05.0821714Z     D=7168,
2025-05-07T20:33:05.0821909Z     scale_ub=None,
2025-05-07T20:33:05.0822139Z     contiguous=False,
2025-05-07T20:33:05.0822378Z     compiled=False,
2025-05-07T20:33:05.0822588Z )
2025-05-07T20:33:05.0822913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.0823407Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.0823670Z 
2025-05-07T20:33:05.0823750Z     @given(
2025-05-07T20:33:05.0823993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.0832625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.0833004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.0833363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.0833715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.0834019Z     )
2025-05-07T20:33:05.0834382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.0834838Z     def test_silu_mul_quant(
2025-05-07T20:33:05.0835098Z         self,
2025-05-07T20:33:05.0835309Z         T: int,
2025-05-07T20:33:05.0835514Z         D: int,
2025-05-07T20:33:05.0835747Z         scale_ub: Optional[float],
2025-05-07T20:33:05.0836034Z         contiguous: bool,
2025-05-07T20:33:05.0836284Z         compiled: bool,
2025-05-07T20:33:05.0836527Z     ) -> None:
2025-05-07T20:33:05.0836755Z         torch.manual_seed(2025)
2025-05-07T20:33:05.0837007Z     
2025-05-07T20:33:05.0837379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.0837790Z     
2025-05-07T20:33:05.0837995Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.0838291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.0838618Z         x = x_sign * x_clamp
2025-05-07T20:33:05.0838888Z         x0 = x[:, :D]
2025-05-07T20:33:05.0839109Z         x1 = x[:, D:]
2025-05-07T20:33:05.0839327Z     
2025-05-07T20:33:05.0839526Z         if contiguous:
2025-05-07T20:33:05.0839765Z             x0 = x0.contiguous()
2025-05-07T20:33:05.0840038Z             x1 = x1.contiguous()
2025-05-07T20:33:05.0840581Z     
2025-05-07T20:33:05.0840780Z         if scale_ub is not None:
2025-05-07T20:33:05.0841072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.0841424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.0841743Z             )
2025-05-07T20:33:05.0841958Z         else:
2025-05-07T20:33:05.0842189Z             scale_ub_tensor = None
2025-05-07T20:33:05.0842447Z     
2025-05-07T20:33:05.0842694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.0843026Z             op = silu_mul_quant
2025-05-07T20:33:05.0843296Z             if compiled:
2025-05-07T20:33:05.0843550Z                 op = torch.compile(op)
2025-05-07T20:33:05.0844039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0844334Z     
2025-05-07T20:33:05.0844534Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.0844714Z 
2025-05-07T20:33:05.0844820Z moe/activation_test.py:117: 
2025-05-07T20:33:05.0845128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0845468Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.0845764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0846478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.0847248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.0847796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.0848493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.0849173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.0849716Z     kernel = self.compile(
2025-05-07T20:33:05.0850270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.0850937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.0851344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0851578Z 
2025-05-07T20:33:05.0851788Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec81a00>
2025-05-07T20:33:05.0852876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.0854271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ecb6670>}
2025-05-07T20:33:05.0855629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.0856664Z context = <triton._C.libtriton.ir.context object at 0x7f158e99ea30>
2025-05-07T20:33:05.0856955Z 
2025-05-07T20:33:05.0857127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.0857668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.0858209Z                            module_map=module_map)
2025-05-07T20:33:05.0858582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.0858946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.0859215Z E       ^
2025-05-07T20:33:05.0859689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.0860148Z 
2025-05-07T20:33:05.0860574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.0861182Z 
2025-05-07T20:33:05.0861290Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.0861715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.0862119Z     T=2048,
2025-05-07T20:33:05.0862315Z     D=7168,
2025-05-07T20:33:05.0862516Z     scale_ub=None,
2025-05-07T20:33:05.0862733Z     contiguous=False,
2025-05-07T20:33:05.0862980Z     compiled=True,
2025-05-07T20:33:05.0863194Z )
2025-05-07T20:33:05.3808694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3809230Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.3809552Z 
2025-05-07T20:33:05.3809642Z     @given(
2025-05-07T20:33:05.3810172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3810653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3811099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3811577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3812038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3812431Z     )
2025-05-07T20:33:05.3812905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3813377Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3813624Z         self,
2025-05-07T20:33:05.3813830Z         T: int,
2025-05-07T20:33:05.3814174Z         D: int,
2025-05-07T20:33:05.3814398Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3814682Z         contiguous: bool,
2025-05-07T20:33:05.3814939Z         compiled: bool,
2025-05-07T20:33:05.3815172Z     ) -> None:
2025-05-07T20:33:05.3815400Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3815661Z     
2025-05-07T20:33:05.3815942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3816295Z     
2025-05-07T20:33:05.3816504Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3816807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3817121Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3817373Z         x0 = x[:, :D]
2025-05-07T20:33:05.3817604Z         x1 = x[:, D:]
2025-05-07T20:33:05.3817855Z     
2025-05-07T20:33:05.3818056Z         if contiguous:
2025-05-07T20:33:05.3818299Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3818569Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3818817Z     
2025-05-07T20:33:05.3819019Z         if scale_ub is not None:
2025-05-07T20:33:05.3819307Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3819651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3819969Z             )
2025-05-07T20:33:05.3820171Z         else:
2025-05-07T20:33:05.3820389Z             scale_ub_tensor = None
2025-05-07T20:33:05.3820654Z     
2025-05-07T20:33:05.3820896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3821302Z             op = silu_mul_quant
2025-05-07T20:33:05.3821567Z             if compiled:
2025-05-07T20:33:05.3821827Z                 op = torch.compile(op)
2025-05-07T20:33:05.3822125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3822408Z     
2025-05-07T20:33:05.3822607Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3822777Z 
2025-05-07T20:33:05.3822888Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3823264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3823614Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3823936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3824500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3825065Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3825731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3826420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3826957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3827639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3828305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3828849Z     kernel = self.compile(
2025-05-07T20:33:05.3829395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3830053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3830505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3830775Z 
2025-05-07T20:33:05.3830993Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e9d27c0>
2025-05-07T20:33:05.3832081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3833455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8c1550>}
2025-05-07T20:33:05.3834800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3835874Z context = <triton._C.libtriton.ir.context object at 0x7f158e8a3db0>
2025-05-07T20:33:05.3836170Z 
2025-05-07T20:33:05.3836354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3836884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3837358Z                            module_map=module_map)
2025-05-07T20:33:05.3837735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3838092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3838362Z E       ^
2025-05-07T20:33:05.3838836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3839284Z 
2025-05-07T20:33:05.3839717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3840418Z 
2025-05-07T20:33:05.3840529Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3840958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3841376Z     T=4096,
2025-05-07T20:33:05.3841567Z     D=7168,
2025-05-07T20:33:05.3841769Z     scale_ub=None,
2025-05-07T20:33:05.3841994Z     contiguous=False,
2025-05-07T20:33:05.3842226Z     compiled=True,
2025-05-07T20:33:05.3842440Z )
2025-05-07T20:33:05.3842767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3843267Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.3843550Z 
2025-05-07T20:33:05.3843632Z     @given(
2025-05-07T20:33:05.3843876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3844202Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3844592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3844933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3845277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3845566Z     )
2025-05-07T20:33:05.3845924Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3846381Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3846625Z         self,
2025-05-07T20:33:05.3846826Z         T: int,
2025-05-07T20:33:05.3847031Z         D: int,
2025-05-07T20:33:05.3847250Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3847533Z         contiguous: bool,
2025-05-07T20:33:05.3847783Z         compiled: bool,
2025-05-07T20:33:05.3848007Z     ) -> None:
2025-05-07T20:33:05.3848238Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3848491Z     
2025-05-07T20:33:05.3848765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3849121Z     
2025-05-07T20:33:05.3849330Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3849622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3849943Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3850194Z         x0 = x[:, :D]
2025-05-07T20:33:05.3850418Z         x1 = x[:, D:]
2025-05-07T20:33:05.3850703Z     
2025-05-07T20:33:05.3850960Z         if contiguous:
2025-05-07T20:33:05.3851208Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3851470Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3851724Z     
2025-05-07T20:33:05.3851928Z         if scale_ub is not None:
2025-05-07T20:33:05.3852206Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3852553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3852872Z             )
2025-05-07T20:33:05.3853071Z         else:
2025-05-07T20:33:05.3853291Z             scale_ub_tensor = None
2025-05-07T20:33:05.3853554Z     
2025-05-07T20:33:05.3853792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3854180Z             op = silu_mul_quant
2025-05-07T20:33:05.3854442Z             if compiled:
2025-05-07T20:33:05.3854695Z                 op = torch.compile(op)
2025-05-07T20:33:05.3854999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3855288Z     
2025-05-07T20:33:05.3855489Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3855667Z 
2025-05-07T20:33:05.3855770Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3856077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3856419Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3856703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3857269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3857838Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3858501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3859199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3859746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3860431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3861149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3861691Z     kernel = self.compile(
2025-05-07T20:33:05.3862238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3862907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3863307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3863548Z 
2025-05-07T20:33:05.3863759Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e8afca0>
2025-05-07T20:33:05.3864891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3866278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e92b160>}
2025-05-07T20:33:05.3867624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3868657Z context = <triton._C.libtriton.ir.context object at 0x7f158e90edf0>
2025-05-07T20:33:05.3868956Z 
2025-05-07T20:33:05.3869125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3869658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3870128Z                            module_map=module_map)
2025-05-07T20:33:05.3870504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3870867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3871129Z E       ^
2025-05-07T20:33:05.3871678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3872133Z 
2025-05-07T20:33:05.3872556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3873067Z 
2025-05-07T20:33:05.5934751Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5936164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5936970Z     T=16384,
2025-05-07T20:33:05.5937296Z     D=5120,
2025-05-07T20:33:05.5937598Z     scale_ub=1200.0,
2025-05-07T20:33:05.5938329Z     contiguous=False,
2025-05-07T20:33:05.5938663Z     compiled=False,
2025-05-07T20:33:05.5938978Z )
2025-05-07T20:33:05.5939510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5941328Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.5941832Z 
2025-05-07T20:33:05.5941968Z     @given(
2025-05-07T20:33:05.5942347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5942873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5943382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5943945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5944497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5944979Z     )
2025-05-07T20:33:05.5945570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5946329Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5946738Z         self,
2025-05-07T20:33:05.5947050Z         T: int,
2025-05-07T20:33:05.5947375Z         D: int,
2025-05-07T20:33:05.5947736Z         scale_ub: Optional[float],
2025-05-07T20:33:05.5948188Z         contiguous: bool,
2025-05-07T20:33:05.5948590Z         compiled: bool,
2025-05-07T20:33:05.5948964Z     ) -> None:
2025-05-07T20:33:05.5949316Z         torch.manual_seed(2025)
2025-05-07T20:33:05.5949718Z     
2025-05-07T20:33:05.5950165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.5950732Z     
2025-05-07T20:33:05.5951045Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.5951529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.5952048Z         x = x_sign * x_clamp
2025-05-07T20:33:05.5952439Z         x0 = x[:, :D]
2025-05-07T20:33:05.5952789Z         x1 = x[:, D:]
2025-05-07T20:33:05.5953129Z     
2025-05-07T20:33:05.5953426Z         if contiguous:
2025-05-07T20:33:05.5953806Z             x0 = x0.contiguous()
2025-05-07T20:33:05.5954398Z             x1 = x1.contiguous()
2025-05-07T20:33:05.5954804Z     
2025-05-07T20:33:05.5955118Z         if scale_ub is not None:
2025-05-07T20:33:05.5955573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.5956119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.5956656Z             )
2025-05-07T20:33:05.5956996Z         else:
2025-05-07T20:33:05.5957336Z             scale_ub_tensor = None
2025-05-07T20:33:05.5957757Z     
2025-05-07T20:33:05.5958136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5958661Z             op = silu_mul_quant
2025-05-07T20:33:05.5959075Z             if compiled:
2025-05-07T20:33:05.5959481Z                 op = torch.compile(op)
2025-05-07T20:33:05.5959970Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5960434Z     
2025-05-07T20:33:05.5960748Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.5961023Z 
2025-05-07T20:33:05.5961195Z moe/activation_test.py:117: 
2025-05-07T20:33:05.5961691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5962222Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.5962681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5963982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.5965269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.5966182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.5967342Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.5968516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.5969452Z     kernel = self.compile(
2025-05-07T20:33:05.5970399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.5971682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.5972362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5972766Z 
2025-05-07T20:33:05.5973106Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7d3250>
2025-05-07T20:33:05.5975038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.5977540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e92b940>}
2025-05-07T20:33:05.5979929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.5981822Z context = <triton._C.libtriton.ir.context object at 0x7f158e7ebd30>
2025-05-07T20:33:05.5982314Z 
2025-05-07T20:33:05.5982596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.5983507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.5984315Z                            module_map=module_map)
2025-05-07T20:33:05.5984929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.5985522Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.5985949Z E       ^
2025-05-07T20:33:05.5986751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.5987559Z 
2025-05-07T20:33:05.5988302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.5989294Z 
2025-05-07T20:33:05.5989477Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5990173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5990866Z     T=16384,
2025-05-07T20:33:05.5991185Z     D=5120,
2025-05-07T20:33:05.5991493Z     scale_ub=1200.0,
2025-05-07T20:33:05.5991870Z     contiguous=True,
2025-05-07T20:33:05.5992234Z     compiled=True,
2025-05-07T20:33:05.5992564Z )
2025-05-07T20:33:05.5993104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5993954Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.5994426Z 
2025-05-07T20:33:05.5994560Z     @given(
2025-05-07T20:33:05.5994925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5995455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5995970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5996526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5997106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5997593Z     )
2025-05-07T20:33:05.5998180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5998947Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5999427Z         self,
2025-05-07T20:33:05.5999792Z         T: int,
2025-05-07T20:33:05.6000117Z         D: int,
2025-05-07T20:33:05.6000477Z         scale_ub: Optional[float],
2025-05-07T20:33:05.6000931Z         contiguous: bool,
2025-05-07T20:33:05.6001326Z         compiled: bool,
2025-05-07T20:33:05.6001696Z     ) -> None:
2025-05-07T20:33:05.6002047Z         torch.manual_seed(2025)
2025-05-07T20:33:05.6002444Z     
2025-05-07T20:33:05.6002889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.6003473Z     
2025-05-07T20:33:05.6003781Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.6004264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.6004839Z         x = x_sign * x_clamp
2025-05-07T20:33:05.6005203Z         x0 = x[:, :D]
2025-05-07T20:33:05.6005514Z         x1 = x[:, D:]
2025-05-07T20:33:05.6005803Z     
2025-05-07T20:33:05.6006065Z         if contiguous:
2025-05-07T20:33:05.6006373Z             x0 = x0.contiguous()
2025-05-07T20:33:05.6006748Z             x1 = x1.contiguous()
2025-05-07T20:33:05.6007095Z     
2025-05-07T20:33:05.6007358Z         if scale_ub is not None:
2025-05-07T20:33:05.6007742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.6008203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.6008615Z             )
2025-05-07T20:33:05.6008879Z         else:
2025-05-07T20:33:05.6009180Z             scale_ub_tensor = None
2025-05-07T20:33:05.6009556Z     
2025-05-07T20:33:05.6009903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.6010366Z             op = silu_mul_quant
2025-05-07T20:33:05.6010716Z             if compiled:
2025-05-07T20:33:05.6011078Z                 op = torch.compile(op)
2025-05-07T20:33:05.6011553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6011970Z     
2025-05-07T20:33:05.6012253Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.6012506Z 
2025-05-07T20:33:05.6012652Z moe/activation_test.py:117: 
2025-05-07T20:33:05.6024964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6025550Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.6025996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6026904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.6027887Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.6029051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.6030285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.6031334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.6032539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.6033708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.6034651Z     kernel = self.compile(
2025-05-07T20:33:05.6035598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.6036748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.6037423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6037838Z 
2025-05-07T20:33:05.6038185Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e850760>
2025-05-07T20:33:05.6040434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.6042947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea18550>}
2025-05-07T20:33:05.6045592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.6047428Z context = <triton._C.libtriton.ir.context object at 0x7f158ea42eb0>
2025-05-07T20:33:05.6047995Z 
2025-05-07T20:33:05.6048272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.6049189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.6049999Z                            module_map=module_map)
2025-05-07T20:33:05.6050728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.6051330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.6051764Z E       ^
2025-05-07T20:33:05.6052577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.6053411Z 
2025-05-07T20:33:05.6054147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.6055064Z 
2025-05-07T20:33:05.8277241Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.8277999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.8278676Z     T=16384,
2025-05-07T20:33:05.8279002Z     D=5120,
2025-05-07T20:33:05.8279303Z     scale_ub=None,
2025-05-07T20:33:05.8279660Z     contiguous=False,
2025-05-07T20:33:05.8280025Z     compiled=True,
2025-05-07T20:33:05.8280362Z )
2025-05-07T20:33:05.8280910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.8281800Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.8282236Z 
2025-05-07T20:33:05.8282356Z     @given(
2025-05-07T20:33:05.8282672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.8283138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.8283589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.8284073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.8284570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.8285013Z     )
2025-05-07T20:33:05.8285567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.8286283Z     def test_silu_mul_quant(
2025-05-07T20:33:05.8286655Z         self,
2025-05-07T20:33:05.8286953Z         T: int,
2025-05-07T20:33:05.8287248Z         D: int,
2025-05-07T20:33:05.8287911Z         scale_ub: Optional[float],
2025-05-07T20:33:05.8288389Z         contiguous: bool,
2025-05-07T20:33:05.8288773Z         compiled: bool,
2025-05-07T20:33:05.8289143Z     ) -> None:
2025-05-07T20:33:05.8289491Z         torch.manual_seed(2025)
2025-05-07T20:33:05.8289863Z     
2025-05-07T20:33:05.8290324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.8290913Z     
2025-05-07T20:33:05.8291211Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.8291694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.8292217Z         x = x_sign * x_clamp
2025-05-07T20:33:05.8292616Z         x0 = x[:, :D]
2025-05-07T20:33:05.8292972Z         x1 = x[:, D:]
2025-05-07T20:33:05.8293321Z     
2025-05-07T20:33:05.8293619Z         if contiguous:
2025-05-07T20:33:05.8294001Z             x0 = x0.contiguous()
2025-05-07T20:33:05.8294441Z             x1 = x1.contiguous()
2025-05-07T20:33:05.8294870Z     
2025-05-07T20:33:05.8295191Z         if scale_ub is not None:
2025-05-07T20:33:05.8295671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.8296258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.8296773Z             )
2025-05-07T20:33:05.8297105Z         else:
2025-05-07T20:33:05.8297460Z             scale_ub_tensor = None
2025-05-07T20:33:05.8298056Z     
2025-05-07T20:33:05.8298568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.8299136Z             op = silu_mul_quant
2025-05-07T20:33:05.8299555Z             if compiled:
2025-05-07T20:33:05.8299988Z                 op = torch.compile(op)
2025-05-07T20:33:05.8300504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.8301073Z     
2025-05-07T20:33:05.8301403Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.8301683Z 
2025-05-07T20:33:05.8301859Z moe/activation_test.py:117: 
2025-05-07T20:33:05.8302363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.8302927Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.8303554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.8304524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.8305500Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.8306655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.8307931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.8308887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.8310091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.8311272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.8312224Z     kernel = self.compile(
2025-05-07T20:33:05.8313166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.8314339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.8315003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.8315399Z 
2025-05-07T20:33:05.8315752Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e81a4f0>
2025-05-07T20:33:05.8317543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.8319951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8fb1f0>}
2025-05-07T20:33:05.8322373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.8324195Z context = <triton._C.libtriton.ir.context object at 0x7f158e8d0db0>
2025-05-07T20:33:05.8324693Z 
2025-05-07T20:33:05.8324985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.8325896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.8326717Z                            module_map=module_map)
2025-05-07T20:33:05.8327309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.8327886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.8328326Z E       ^
2025-05-07T20:33:05.8329138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.8329939Z 
2025-05-07T20:33:05.8330685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.8331589Z 
2025-05-07T20:33:05.8331759Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.8332464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.8333162Z     T=2048,
2025-05-07T20:33:05.8333455Z     D=5120,
2025-05-07T20:33:05.8333913Z     scale_ub=None,
2025-05-07T20:33:05.8334273Z     contiguous=False,
2025-05-07T20:33:05.8334645Z     compiled=True,
2025-05-07T20:33:05.8334970Z )
2025-05-07T20:33:05.9545671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9546602Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.9547079Z 
2025-05-07T20:33:05.9547209Z     @given(
2025-05-07T20:33:05.9547603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9548139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9548627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9550170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9550726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9551218Z     )
2025-05-07T20:33:05.9551819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9552589Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9552997Z         self,
2025-05-07T20:33:05.9553325Z         T: int,
2025-05-07T20:33:05.9553655Z         D: int,
2025-05-07T20:33:05.9554008Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9554463Z         contiguous: bool,
2025-05-07T20:33:05.9554864Z         compiled: bool,
2025-05-07T20:33:05.9555230Z     ) -> None:
2025-05-07T20:33:05.9555587Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9555997Z     
2025-05-07T20:33:05.9556437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9557020Z     
2025-05-07T20:33:05.9557340Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9557823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9558354Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9558761Z         x0 = x[:, :D]
2025-05-07T20:33:05.9559113Z         x1 = x[:, D:]
2025-05-07T20:33:05.9559460Z     
2025-05-07T20:33:05.9559769Z         if contiguous:
2025-05-07T20:33:05.9560152Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9560595Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9561002Z     
2025-05-07T20:33:05.9561311Z         if scale_ub is not None:
2025-05-07T20:33:05.9561770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9562330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9562850Z             )
2025-05-07T20:33:05.9563161Z         else:
2025-05-07T20:33:05.9563509Z             scale_ub_tensor = None
2025-05-07T20:33:05.9563931Z     
2025-05-07T20:33:05.9564305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9564841Z             op = silu_mul_quant
2025-05-07T20:33:05.9565408Z             if compiled:
2025-05-07T20:33:05.9565819Z                 op = torch.compile(op)
2025-05-07T20:33:05.9566317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9566784Z     
2025-05-07T20:33:05.9567091Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9567377Z 
2025-05-07T20:33:05.9567550Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9568087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9568668Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9569133Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9570105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.9571083Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.9572226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9573394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9574306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9575480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9576740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9577774Z     kernel = self.compile(
2025-05-07T20:33:05.9578673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9579822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9580505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9580907Z 
2025-05-07T20:33:05.9581391Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e776b50>
2025-05-07T20:33:05.9583320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9585919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8fbf70>}
2025-05-07T20:33:05.9588338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9590147Z context = <triton._C.libtriton.ir.context object at 0x7f158e74ab30>
2025-05-07T20:33:05.9590647Z 
2025-05-07T20:33:05.9590935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9591835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9592662Z                            module_map=module_map)
2025-05-07T20:33:05.9593268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9593855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9594297Z E       ^
2025-05-07T20:33:05.9595111Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9595921Z 
2025-05-07T20:33:05.9596670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9597583Z 
2025-05-07T20:33:05.9597757Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.9598516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.9599222Z     T=2048,
2025-05-07T20:33:05.9599525Z     D=5120,
2025-05-07T20:33:05.9599846Z     scale_ub=1200.0,
2025-05-07T20:33:05.9600221Z     contiguous=False,
2025-05-07T20:33:05.9600598Z     compiled=True,
2025-05-07T20:33:05.9601011Z )
2025-05-07T20:33:05.9601560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9602418Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.9602894Z 
2025-05-07T20:33:05.9603021Z     @given(
2025-05-07T20:33:05.9603411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9603945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9604458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9605026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9605594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9606088Z     )
2025-05-07T20:33:05.9606679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9607446Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9607851Z         self,
2025-05-07T20:33:05.9608213Z         T: int,
2025-05-07T20:33:05.9608547Z         D: int,
2025-05-07T20:33:05.9608914Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9609363Z         contiguous: bool,
2025-05-07T20:33:05.9609769Z         compiled: bool,
2025-05-07T20:33:05.9610149Z     ) -> None:
2025-05-07T20:33:05.9610502Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9610913Z     
2025-05-07T20:33:05.9611494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9612071Z     
2025-05-07T20:33:05.9612392Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9612880Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9613401Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9613806Z         x0 = x[:, :D]
2025-05-07T20:33:05.9614164Z         x1 = x[:, D:]
2025-05-07T20:33:05.9614502Z     
2025-05-07T20:33:05.9614809Z         if contiguous:
2025-05-07T20:33:05.9615196Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9615609Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9616003Z     
2025-05-07T20:33:05.9616360Z         if scale_ub is not None:
2025-05-07T20:33:05.9616743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9617201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9617624Z             )
2025-05-07T20:33:05.9617897Z         else:
2025-05-07T20:33:05.9618192Z             scale_ub_tensor = None
2025-05-07T20:33:05.9618548Z     
2025-05-07T20:33:05.9618875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9619305Z             op = silu_mul_quant
2025-05-07T20:33:05.9619641Z             if compiled:
2025-05-07T20:33:05.9619984Z                 op = torch.compile(op)
2025-05-07T20:33:05.9620414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9620832Z     
2025-05-07T20:33:05.9621226Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9621471Z 
2025-05-07T20:33:05.9621621Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9622036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9622550Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9622969Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9623770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.9624598Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.9625587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9626587Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9627405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9628510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9629591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9630452Z     kernel = self.compile(
2025-05-07T20:33:05.9631428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9632495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9633128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9633509Z 
2025-05-07T20:33:05.9633834Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7191f0>
2025-05-07T20:33:05.9635613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9637908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e737940>}
2025-05-07T20:33:05.9640693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9642334Z context = <triton._C.libtriton.ir.context object at 0x7f158e6e5fb0>
2025-05-07T20:33:05.9642808Z 
2025-05-07T20:33:05.9643305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9644154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9644942Z                            module_map=module_map)
2025-05-07T20:33:05.9645529Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9646124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9646560Z E       ^
2025-05-07T20:33:05.9647354Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9648142Z 
2025-05-07T20:33:05.9648856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9649879Z 
2025-05-07T20:33:06.3831136Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3832401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3833163Z     T=4096,
2025-05-07T20:33:06.3833481Z     D=5120,
2025-05-07T20:33:06.3833795Z     scale_ub=1200.0,
2025-05-07T20:33:06.3834157Z     contiguous=True,
2025-05-07T20:33:06.3834530Z     compiled=True,
2025-05-07T20:33:06.3834865Z )
2025-05-07T20:33:06.3835402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3836225Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.3836615Z 
2025-05-07T20:33:06.3836734Z     @given(
2025-05-07T20:33:06.3837053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3837505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3837967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3838456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3838976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3839435Z     )
2025-05-07T20:33:06.3839996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3841009Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3841383Z         self,
2025-05-07T20:33:06.3841679Z         T: int,
2025-05-07T20:33:06.3841966Z         D: int,
2025-05-07T20:33:06.3842307Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3842742Z         contiguous: bool,
2025-05-07T20:33:06.3843120Z         compiled: bool,
2025-05-07T20:33:06.3843476Z     ) -> None:
2025-05-07T20:33:06.3843821Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3844227Z     
2025-05-07T20:33:06.3844687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3845243Z     
2025-05-07T20:33:06.3845885Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3846387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3846908Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3847294Z         x0 = x[:, :D]
2025-05-07T20:33:06.3847642Z         x1 = x[:, D:]
2025-05-07T20:33:06.3847982Z     
2025-05-07T20:33:06.3848291Z         if contiguous:
2025-05-07T20:33:06.3848680Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3849124Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3849522Z     
2025-05-07T20:33:06.3849846Z         if scale_ub is not None:
2025-05-07T20:33:06.3850310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3850871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3851376Z             )
2025-05-07T20:33:06.3851697Z         else:
2025-05-07T20:33:06.3852044Z             scale_ub_tensor = None
2025-05-07T20:33:06.3852455Z     
2025-05-07T20:33:06.3852837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3853392Z             op = silu_mul_quant
2025-05-07T20:33:06.3853808Z             if compiled:
2025-05-07T20:33:06.3854221Z                 op = torch.compile(op)
2025-05-07T20:33:06.3854727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3855192Z     
2025-05-07T20:33:06.3855511Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3856054Z 
2025-05-07T20:33:06.3856236Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3856728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3857293Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3857765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3858738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.3859692Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.3860850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3862379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3863307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3864506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3865690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3866619Z     kernel = self.compile(
2025-05-07T20:33:06.3867541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3868640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3869289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3869658Z 
2025-05-07T20:33:06.3869997Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e6fa5e0>
2025-05-07T20:33:06.3871782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3874178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e5c1790>}
2025-05-07T20:33:06.3876543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3878336Z context = <triton._C.libtriton.ir.context object at 0x7f158e50bf30>
2025-05-07T20:33:06.3878840Z 
2025-05-07T20:33:06.3879117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3880109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3880913Z                            module_map=module_map)
2025-05-07T20:33:06.3881524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3882106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3882549Z E       ^
2025-05-07T20:33:06.3883387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3884189Z 
2025-05-07T20:33:06.3884918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3885838Z 
2025-05-07T20:33:06.3886012Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3886709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3887388Z     T=128,
2025-05-07T20:33:06.3887692Z     D=5120,
2025-05-07T20:33:06.3898511Z     scale_ub=1200.0,
2025-05-07T20:33:06.3898942Z     contiguous=False,
2025-05-07T20:33:06.3899319Z     compiled=True,
2025-05-07T20:33:06.3899664Z )
2025-05-07T20:33:06.5215204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5216142Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.5216626Z 
2025-05-07T20:33:06.5217258Z     @given(
2025-05-07T20:33:06.5217658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5218173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5218677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5219165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5219694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5220177Z     )
2025-05-07T20:33:06.5220781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5221697Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5222116Z         self,
2025-05-07T20:33:06.5222603Z         T: int,
2025-05-07T20:33:06.5222921Z         D: int,
2025-05-07T20:33:06.5223288Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5223748Z         contiguous: bool,
2025-05-07T20:33:06.5224150Z         compiled: bool,
2025-05-07T20:33:06.5224519Z     ) -> None:
2025-05-07T20:33:06.5224882Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5225302Z     
2025-05-07T20:33:06.5225745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5226342Z     
2025-05-07T20:33:06.5226662Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5227141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5227674Z         x = x_sign * x_clamp
2025-05-07T20:33:06.5228074Z         x0 = x[:, :D]
2025-05-07T20:33:06.5228422Z         x1 = x[:, D:]
2025-05-07T20:33:06.5228771Z     
2025-05-07T20:33:06.5229087Z         if contiguous:
2025-05-07T20:33:06.5229470Z             x0 = x0.contiguous()
2025-05-07T20:33:06.5229913Z             x1 = x1.contiguous()
2025-05-07T20:33:06.5230336Z     
2025-05-07T20:33:06.5230648Z         if scale_ub is not None:
2025-05-07T20:33:06.5231128Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.5231711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.5232243Z             )
2025-05-07T20:33:06.5232554Z         else:
2025-05-07T20:33:06.5232908Z             scale_ub_tensor = None
2025-05-07T20:33:06.5233335Z     
2025-05-07T20:33:06.5233704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.5234239Z             op = silu_mul_quant
2025-05-07T20:33:06.5234659Z             if compiled:
2025-05-07T20:33:06.5235064Z                 op = torch.compile(op)
2025-05-07T20:33:06.5235564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5236037Z     
2025-05-07T20:33:06.5236348Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.5236641Z 
2025-05-07T20:33:06.5236804Z moe/activation_test.py:117: 
2025-05-07T20:33:06.5237443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5238018Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.5238502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5239475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.5240977Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.5242137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.5243344Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.5244251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.5245408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.5246549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.5247472Z     kernel = self.compile(
2025-05-07T20:33:06.5248443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.5249542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.5250352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5250845Z 
2025-05-07T20:33:06.5251195Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e620910>
2025-05-07T20:33:06.5253142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.5255639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e61e0d0>}
2025-05-07T20:33:06.5258165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.5259996Z context = <triton._C.libtriton.ir.context object at 0x7f158e61f970>
2025-05-07T20:33:06.5260498Z 
2025-05-07T20:33:06.5260795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.5261836Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.5262648Z                            module_map=module_map)
2025-05-07T20:33:06.5263268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.5263858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.5264286Z E       ^
2025-05-07T20:33:06.5265096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.5265907Z 
2025-05-07T20:33:06.5266662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.5267576Z 
2025-05-07T20:33:06.5267758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5268518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5269222Z     T=16384,
2025-05-07T20:33:06.5269547Z     D=7168,
2025-05-07T20:33:06.5269861Z     scale_ub=1200.0,
2025-05-07T20:33:06.5270238Z     contiguous=True,
2025-05-07T20:33:06.5270613Z     compiled=True,
2025-05-07T20:33:06.5270952Z )
2025-05-07T20:33:06.5271494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5272348Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.5272824Z 
2025-05-07T20:33:06.5272953Z     @given(
2025-05-07T20:33:06.5273340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5274006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5274534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5275087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5275656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5276146Z     )
2025-05-07T20:33:06.5276750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5277522Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5277938Z         self,
2025-05-07T20:33:06.5278251Z         T: int,
2025-05-07T20:33:06.5278581Z         D: int,
2025-05-07T20:33:06.5278955Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5279409Z         contiguous: bool,
2025-05-07T20:33:06.5279818Z         compiled: bool,
2025-05-07T20:33:06.5280198Z     ) -> None:
2025-05-07T20:33:06.5280550Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5280965Z     
2025-05-07T20:33:06.5281421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5282019Z     
2025-05-07T20:33:06.5282339Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5282834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5283347Z         x = x_sign * x_clamp
2025-05-07T20:33:06.5283755Z         x0 = x[:, :D]
2025-05-07T20:33:06.5284119Z         x1 = x[:, D:]
2025-05-07T20:33:06.5284580Z     
2025-05-07T20:33:06.5284886Z         if contiguous:
2025-05-07T20:33:06.5285268Z             x0 = x0.contiguous()
2025-05-07T20:33:06.5285693Z             x1 = x1.contiguous()
2025-05-07T20:33:06.5286101Z     
2025-05-07T20:33:06.5286398Z         if scale_ub is not None:
2025-05-07T20:33:06.5286827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.5287280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.5287714Z             )
2025-05-07T20:33:06.5287979Z         else:
2025-05-07T20:33:06.5288268Z             scale_ub_tensor = None
2025-05-07T20:33:06.5288620Z     
2025-05-07T20:33:06.5289026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.5289467Z             op = silu_mul_quant
2025-05-07T20:33:06.5289835Z             if compiled:
2025-05-07T20:33:06.5290203Z                 op = torch.compile(op)
2025-05-07T20:33:06.5290625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5291010Z     
2025-05-07T20:33:06.5291286Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.5291519Z 
2025-05-07T20:33:06.5291660Z moe/activation_test.py:117: 
2025-05-07T20:33:06.5292115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5292618Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.5293025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5293841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.5294670Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.5295624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.5296664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.5297449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.5298456Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.5299432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.5300258Z     kernel = self.compile(
2025-05-07T20:33:06.5301265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.5302336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.5302968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5303347Z 
2025-05-07T20:33:06.5303773Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e629610>
2025-05-07T20:33:06.5305564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.5307867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e61ed30>}
2025-05-07T20:33:06.5310132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.5311812Z context = <triton._C.libtriton.ir.context object at 0x7f158e52c8b0>
2025-05-07T20:33:06.5312287Z 
2025-05-07T20:33:06.5312549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.5313404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.5314157Z                            module_map=module_map)
2025-05-07T20:33:06.5314736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.5315294Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.5315821Z E       ^
2025-05-07T20:33:06.5316572Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.5317328Z 
2025-05-07T20:33:06.5318002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.5318845Z 
2025-05-07T20:33:06.8057876Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8058611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8059305Z     T=16384,
2025-05-07T20:33:06.8059627Z     D=5120,
2025-05-07T20:33:06.8060246Z     scale_ub=1200.0,
2025-05-07T20:33:06.8060619Z     contiguous=True,
2025-05-07T20:33:06.8061088Z     compiled=False,
2025-05-07T20:33:06.8061435Z )
2025-05-07T20:33:06.8061954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8062800Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.8063259Z 
2025-05-07T20:33:06.8063374Z     @given(
2025-05-07T20:33:06.8063698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8064142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8064591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8065083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8065584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8066040Z     )
2025-05-07T20:33:06.8066596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8067295Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8067665Z         self,
2025-05-07T20:33:06.8067958Z         T: int,
2025-05-07T20:33:06.8068249Z         D: int,
2025-05-07T20:33:06.8068579Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8068999Z         contiguous: bool,
2025-05-07T20:33:06.8069368Z         compiled: bool,
2025-05-07T20:33:06.8069730Z     ) -> None:
2025-05-07T20:33:06.8070091Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8070487Z     
2025-05-07T20:33:06.8070900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8071449Z     
2025-05-07T20:33:06.8071776Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.8072261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.8072766Z         x = x_sign * x_clamp
2025-05-07T20:33:06.8073161Z         x0 = x[:, :D]
2025-05-07T20:33:06.8073502Z         x1 = x[:, D:]
2025-05-07T20:33:06.8073852Z     
2025-05-07T20:33:06.8074153Z         if contiguous:
2025-05-07T20:33:06.8074520Z             x0 = x0.contiguous()
2025-05-07T20:33:06.8075131Z             x1 = x1.contiguous()
2025-05-07T20:33:06.8075532Z     
2025-05-07T20:33:06.8075846Z         if scale_ub is not None:
2025-05-07T20:33:06.8076313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.8076875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.8077404Z             )
2025-05-07T20:33:06.8077726Z         else:
2025-05-07T20:33:06.8078095Z             scale_ub_tensor = None
2025-05-07T20:33:06.8078536Z     
2025-05-07T20:33:06.8078921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.8079452Z             op = silu_mul_quant
2025-05-07T20:33:06.8079877Z             if compiled:
2025-05-07T20:33:06.8080278Z                 op = torch.compile(op)
2025-05-07T20:33:06.8080780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8081248Z     
2025-05-07T20:33:06.8081557Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.8081846Z 
2025-05-07T20:33:06.8082013Z moe/activation_test.py:117: 
2025-05-07T20:33:06.8082531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8083094Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.8083569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8084917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.8086245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.8087172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.8088405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.8089571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.8090512Z     kernel = self.compile(
2025-05-07T20:33:06.8091473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.8092699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.8093394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8093793Z 
2025-05-07T20:33:06.8094134Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7a7f70>
2025-05-07T20:33:06.8096063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.8098439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e452700>}
2025-05-07T20:33:06.8100692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.8102552Z context = <triton._C.libtriton.ir.context object at 0x7f158e7bcbf0>
2025-05-07T20:33:06.8103041Z 
2025-05-07T20:33:06.8103313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.8104216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.8105025Z                            module_map=module_map)
2025-05-07T20:33:06.8105625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.8106217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.8106647Z E       ^
2025-05-07T20:33:06.8107460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.8108292Z 
2025-05-07T20:33:06.8109100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.8110017Z 
2025-05-07T20:33:06.8110192Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8110905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8111588Z     T=1,
2025-05-07T20:33:06.8111891Z     D=7168,
2025-05-07T20:33:06.8112214Z     scale_ub=1200.0,
2025-05-07T20:33:06.8112592Z     contiguous=False,
2025-05-07T20:33:06.8112960Z     compiled=False,
2025-05-07T20:33:06.8113299Z )
2025-05-07T20:33:06.8113836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8114663Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:06.8115138Z 
2025-05-07T20:33:06.8115265Z     @given(
2025-05-07T20:33:06.8115638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8116156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8116675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8117239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8117791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8118279Z     )
2025-05-07T20:33:06.8118874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8119649Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8120181Z         self,
2025-05-07T20:33:06.8120508Z         T: int,
2025-05-07T20:33:06.8120830Z         D: int,
2025-05-07T20:33:06.8121180Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8121639Z         contiguous: bool,
2025-05-07T20:33:06.8122035Z         compiled: bool,
2025-05-07T20:33:06.8122387Z     ) -> None:
2025-05-07T20:33:06.8122742Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8123145Z     
2025-05-07T20:33:06.8123583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8124163Z     
2025-05-07T20:33:06.8124479Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.8124965Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.8125589Z         x = x_sign * x_clamp
2025-05-07T20:33:06.8125996Z         x0 = x[:, :D]
2025-05-07T20:33:06.8126345Z         x1 = x[:, D:]
2025-05-07T20:33:06.8126688Z     
2025-05-07T20:33:06.8126993Z         if contiguous:
2025-05-07T20:33:06.8127368Z             x0 = x0.contiguous()
2025-05-07T20:33:06.8127816Z             x1 = x1.contiguous()
2025-05-07T20:33:06.8128257Z     
2025-05-07T20:33:06.8128601Z         if scale_ub is not None:
2025-05-07T20:33:06.8129057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.8129622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.8130155Z             )
2025-05-07T20:33:06.8130465Z         else:
2025-05-07T20:33:06.8130818Z             scale_ub_tensor = None
2025-05-07T20:33:06.8131243Z     
2025-05-07T20:33:06.8131616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.8132153Z             op = silu_mul_quant
2025-05-07T20:33:06.8132583Z             if compiled:
2025-05-07T20:33:06.8132988Z                 op = torch.compile(op)
2025-05-07T20:33:06.8133478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8133937Z     
2025-05-07T20:33:06.8134241Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.8134530Z 
2025-05-07T20:33:06.8134692Z moe/activation_test.py:117: 
2025-05-07T20:33:06.8135189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8135755Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.8136217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8137382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.8138642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.8139567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.8141226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.8142354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.8143248Z     kernel = self.compile(
2025-05-07T20:33:06.8144190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.8145342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.8146028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8146431Z 
2025-05-07T20:33:06.8146783Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e3edaf0>
2025-05-07T20:33:06.8148747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.8151246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e6900d0>}
2025-05-07T20:33:06.8153707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.8155549Z context = <triton._C.libtriton.ir.context object at 0x7f158e68f330>
2025-05-07T20:33:06.8156037Z 
2025-05-07T20:33:06.8156311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.8157238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.8158007Z                            module_map=module_map)
2025-05-07T20:33:06.8158576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.8159070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.8159615Z E       ^
2025-05-07T20:33:06.8160412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.8161214Z 
2025-05-07T20:33:06.8161950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.8162878Z 
2025-05-07T20:33:06.8163050Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8163758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8164455Z     T=4096,
2025-05-07T20:33:06.8164754Z     D=7168,
2025-05-07T20:33:06.8165068Z     scale_ub=1200.0,
2025-05-07T20:33:06.8165439Z     contiguous=False,
2025-05-07T20:33:06.8165803Z     compiled=True,
2025-05-07T20:33:06.8166141Z )
2025-05-07T20:33:06.9338095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9339066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.9339586Z 
2025-05-07T20:33:06.9339723Z     @given(
2025-05-07T20:33:06.9340442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9340963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9341562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9342066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9342591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9343073Z     )
2025-05-07T20:33:06.9343670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9344428Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9344831Z         self,
2025-05-07T20:33:06.9345153Z         T: int,
2025-05-07T20:33:06.9345465Z         D: int,
2025-05-07T20:33:06.9345825Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9346278Z         contiguous: bool,
2025-05-07T20:33:06.9346670Z         compiled: bool,
2025-05-07T20:33:06.9347043Z     ) -> None:
2025-05-07T20:33:06.9347743Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9348179Z     
2025-05-07T20:33:06.9348659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9349240Z     
2025-05-07T20:33:06.9349546Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9350034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9350561Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9350958Z         x0 = x[:, :D]
2025-05-07T20:33:06.9351312Z         x1 = x[:, D:]
2025-05-07T20:33:06.9351652Z     
2025-05-07T20:33:06.9351957Z         if contiguous:
2025-05-07T20:33:06.9352335Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9352766Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9353164Z     
2025-05-07T20:33:06.9353478Z         if scale_ub is not None:
2025-05-07T20:33:06.9353935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9354494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9355016Z             )
2025-05-07T20:33:06.9355356Z         else:
2025-05-07T20:33:06.9355703Z             scale_ub_tensor = None
2025-05-07T20:33:06.9356120Z     
2025-05-07T20:33:06.9367959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9368597Z             op = silu_mul_quant
2025-05-07T20:33:06.9369300Z             if compiled:
2025-05-07T20:33:06.9369718Z                 op = torch.compile(op)
2025-05-07T20:33:06.9370190Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9370658Z     
2025-05-07T20:33:06.9370982Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9371256Z 
2025-05-07T20:33:06.9371421Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9371927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9372500Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9372987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9373967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.9375092Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.9376261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9377477Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9378425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9379627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9380798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9381794Z     kernel = self.compile(
2025-05-07T20:33:06.9382720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9383874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9384564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9384961Z 
2025-05-07T20:33:06.9385304Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e39a250>
2025-05-07T20:33:06.9387231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9389732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e690dc0>}
2025-05-07T20:33:06.9392151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9394042Z context = <triton._C.libtriton.ir.context object at 0x7f158e49c430>
2025-05-07T20:33:06.9394560Z 
2025-05-07T20:33:06.9394839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9395749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9396577Z                            module_map=module_map)
2025-05-07T20:33:06.9397179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9397770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9398243Z E       ^
2025-05-07T20:33:06.9399058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9399873Z 
2025-05-07T20:33:06.9400608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9401528Z 
2025-05-07T20:33:06.9401698Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9402418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9403105Z     T=128,
2025-05-07T20:33:06.9403415Z     D=7168,
2025-05-07T20:33:06.9403738Z     scale_ub=1200.0,
2025-05-07T20:33:06.9404105Z     contiguous=False,
2025-05-07T20:33:06.9404480Z     compiled=True,
2025-05-07T20:33:06.9404905Z )
2025-05-07T20:33:06.9405488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9406339Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.9406820Z 
2025-05-07T20:33:06.9406948Z     @given(
2025-05-07T20:33:06.9407330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9407850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9408338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9408813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9409255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9409735Z     )
2025-05-07T20:33:06.9410232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9410849Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9411182Z         self,
2025-05-07T20:33:06.9411456Z         T: int,
2025-05-07T20:33:06.9411725Z         D: int,
2025-05-07T20:33:06.9412049Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9412438Z         contiguous: bool,
2025-05-07T20:33:06.9412785Z         compiled: bool,
2025-05-07T20:33:06.9413094Z     ) -> None:
2025-05-07T20:33:06.9413400Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9413745Z     
2025-05-07T20:33:06.9414142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9414656Z     
2025-05-07T20:33:06.9414948Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9415371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9415823Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9416182Z         x0 = x[:, :D]
2025-05-07T20:33:06.9416522Z         x1 = x[:, D:]
2025-05-07T20:33:06.9416842Z     
2025-05-07T20:33:06.9417119Z         if contiguous:
2025-05-07T20:33:06.9417453Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9417833Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9418205Z     
2025-05-07T20:33:06.9418497Z         if scale_ub is not None:
2025-05-07T20:33:06.9418872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9419381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9419895Z             )
2025-05-07T20:33:06.9420190Z         else:
2025-05-07T20:33:06.9420527Z             scale_ub_tensor = None
2025-05-07T20:33:06.9420933Z     
2025-05-07T20:33:06.9421382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9421895Z             op = silu_mul_quant
2025-05-07T20:33:06.9422289Z             if compiled:
2025-05-07T20:33:06.9422684Z                 op = torch.compile(op)
2025-05-07T20:33:06.9423268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9423743Z     
2025-05-07T20:33:06.9424049Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9424344Z 
2025-05-07T20:33:06.9424509Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9425005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9425565Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9426051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9426968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.9427910Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.9429021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9430151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9431082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9432296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9433470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9434410Z     kernel = self.compile(
2025-05-07T20:33:06.9435506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9436656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9437346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9437755Z 
2025-05-07T20:33:06.9438126Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e4b55e0>
2025-05-07T20:33:06.9440429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9443047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e3bf940>}
2025-05-07T20:33:06.9445464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9447295Z context = <triton._C.libtriton.ir.context object at 0x7f158e5ce1f0>
2025-05-07T20:33:06.9447804Z 
2025-05-07T20:33:06.9448083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9449041Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9449845Z                            module_map=module_map)
2025-05-07T20:33:06.9450458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9451064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9451502Z E       ^
2025-05-07T20:33:06.9452297Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9453109Z 
2025-05-07T20:33:06.9453848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9454763Z 
2025-05-07T20:33:07.1190712Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1191495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1192162Z     T=2048,
2025-05-07T20:33:07.1192463Z     D=7168,
2025-05-07T20:33:07.1192783Z     scale_ub=None,
2025-05-07T20:33:07.1193138Z     contiguous=True,
2025-05-07T20:33:07.1193501Z     compiled=True,
2025-05-07T20:33:07.1193845Z )
2025-05-07T20:33:07.1194367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1195552Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.1196048Z 
2025-05-07T20:33:07.1196178Z     @given(
2025-05-07T20:33:07.1196554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1197083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1197595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1198187Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1198774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1199252Z     )
2025-05-07T20:33:07.1199851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1200624Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1201025Z         self,
2025-05-07T20:33:07.1201339Z         T: int,
2025-05-07T20:33:07.1201663Z         D: int,
2025-05-07T20:33:07.1202025Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1202478Z         contiguous: bool,
2025-05-07T20:33:07.1202876Z         compiled: bool,
2025-05-07T20:33:07.1203265Z     ) -> None:
2025-05-07T20:33:07.1203616Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1204022Z     
2025-05-07T20:33:07.1204478Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1205059Z     
2025-05-07T20:33:07.1205375Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1206116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1206646Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1207056Z         x0 = x[:, :D]
2025-05-07T20:33:07.1207410Z         x1 = x[:, D:]
2025-05-07T20:33:07.1207748Z     
2025-05-07T20:33:07.1208054Z         if contiguous:
2025-05-07T20:33:07.1208435Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1208860Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1209267Z     
2025-05-07T20:33:07.1209585Z         if scale_ub is not None:
2025-05-07T20:33:07.1210041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1210601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1211293Z             )
2025-05-07T20:33:07.1211613Z         else:
2025-05-07T20:33:07.1211949Z             scale_ub_tensor = None
2025-05-07T20:33:07.1212379Z     
2025-05-07T20:33:07.1212774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1213328Z             op = silu_mul_quant
2025-05-07T20:33:07.1213763Z             if compiled:
2025-05-07T20:33:07.1214180Z                 op = torch.compile(op)
2025-05-07T20:33:07.1214669Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1215141Z     
2025-05-07T20:33:07.1215453Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.1215737Z 
2025-05-07T20:33:07.1215900Z moe/activation_test.py:117: 
2025-05-07T20:33:07.1216392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1216960Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.1217435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1218393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.1219377Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.1220491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.1221843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.1222761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1223934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1225084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1226035Z     kernel = self.compile(
2025-05-07T20:33:07.1226979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1228218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1228948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1229349Z 
2025-05-07T20:33:07.1229694Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e4b0460>
2025-05-07T20:33:07.1231578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1234063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e383550>}
2025-05-07T20:33:07.1236448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1238238Z context = <triton._C.libtriton.ir.context object at 0x7f158e27e430>
2025-05-07T20:33:07.1238750Z 
2025-05-07T20:33:07.1239029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1239932Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1241264Z                            module_map=module_map)
2025-05-07T20:33:07.1241888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1242478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.1242920Z E       ^
2025-05-07T20:33:07.1243690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1244490Z 
2025-05-07T20:33:07.1245227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.1246141Z 
2025-05-07T20:33:07.1246323Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1247136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1247828Z     T=16384,
2025-05-07T20:33:07.1248140Z     D=5120,
2025-05-07T20:33:07.1248453Z     scale_ub=None,
2025-05-07T20:33:07.1248793Z     contiguous=False,
2025-05-07T20:33:07.1249175Z     compiled=False,
2025-05-07T20:33:07.1249511Z )
2025-05-07T20:33:07.1250033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1250886Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.1251362Z 
2025-05-07T20:33:07.1251499Z     @given(
2025-05-07T20:33:07.1251861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1252385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1252902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1253447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1254016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1254502Z     )
2025-05-07T20:33:07.1255096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1255854Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1256255Z         self,
2025-05-07T20:33:07.1256570Z         T: int,
2025-05-07T20:33:07.1256889Z         D: int,
2025-05-07T20:33:07.1257248Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1257704Z         contiguous: bool,
2025-05-07T20:33:07.1258114Z         compiled: bool,
2025-05-07T20:33:07.1258499Z     ) -> None:
2025-05-07T20:33:07.1258851Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1259241Z     
2025-05-07T20:33:07.1259686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1260265Z     
2025-05-07T20:33:07.1260570Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1261139Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1264583Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.1267968Z 
2025-05-07T20:33:07.1268174Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.1268550Z 
2025-05-07T20:33:07.1268719Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1269430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1270118Z     T=4096,
2025-05-07T20:33:07.1270430Z     D=7168,
2025-05-07T20:33:07.1270746Z     scale_ub=1200.0,
2025-05-07T20:33:07.1271122Z     contiguous=True,
2025-05-07T20:33:07.1271490Z     compiled=True,
2025-05-07T20:33:07.1271823Z )
2025-05-07T20:33:07.1272353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1273206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.1273828Z 
2025-05-07T20:33:07.1274013Z     @given(
2025-05-07T20:33:07.1274392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1274914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1275434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1275988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1276528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1277013Z     )
2025-05-07T20:33:07.1277590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1278339Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1278745Z         self,
2025-05-07T20:33:07.1279154Z         T: int,
2025-05-07T20:33:07.1279471Z         D: int,
2025-05-07T20:33:07.1279824Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1280286Z         contiguous: bool,
2025-05-07T20:33:07.1280690Z         compiled: bool,
2025-05-07T20:33:07.1281048Z     ) -> None:
2025-05-07T20:33:07.1281402Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1281787Z     
2025-05-07T20:33:07.1282203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1282772Z     
2025-05-07T20:33:07.1283085Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1283560Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1287193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.1290579Z 
2025-05-07T20:33:07.1290780Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.1291145Z 
2025-05-07T20:33:07.1291325Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1292027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1292711Z     T=16384,
2025-05-07T20:33:07.1293031Z     D=7168,
2025-05-07T20:33:07.1293339Z     scale_ub=None,
2025-05-07T20:33:07.1293686Z     contiguous=False,
2025-05-07T20:33:07.1294057Z     compiled=False,
2025-05-07T20:33:07.1294393Z )
2025-05-07T20:33:07.2333867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2335117Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.2335618Z 
2025-05-07T20:33:07.2335744Z     @given(
2025-05-07T20:33:07.2336125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2336654Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2337163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2337665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2338166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2338613Z     )
2025-05-07T20:33:07.2339158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2339883Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2340631Z         self,
2025-05-07T20:33:07.2340963Z         T: int,
2025-05-07T20:33:07.2341357Z         D: int,
2025-05-07T20:33:07.2341710Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2342165Z         contiguous: bool,
2025-05-07T20:33:07.2342548Z         compiled: bool,
2025-05-07T20:33:07.2342913Z     ) -> None:
2025-05-07T20:33:07.2343283Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2343696Z     
2025-05-07T20:33:07.2344136Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2347941Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2351450Z 
2025-05-07T20:33:07.2351656Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2352035Z 
2025-05-07T20:33:07.2352205Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2353053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2353746Z     T=2048,
2025-05-07T20:33:07.2354057Z     D=7168,
2025-05-07T20:33:07.2354374Z     scale_ub=1200.0,
2025-05-07T20:33:07.2354737Z     contiguous=True,
2025-05-07T20:33:07.2355106Z     compiled=True,
2025-05-07T20:33:07.2355463Z )
2025-05-07T20:33:07.2356000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2356837Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.2357311Z 
2025-05-07T20:33:07.2357436Z     @given(
2025-05-07T20:33:07.2357815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2358328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2358846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2359406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2359953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2360455Z     )
2025-05-07T20:33:07.2361059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2361837Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2362237Z         self,
2025-05-07T20:33:07.2362562Z         T: int,
2025-05-07T20:33:07.2362894Z         D: int,
2025-05-07T20:33:07.2363253Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2363712Z         contiguous: bool,
2025-05-07T20:33:07.2364119Z         compiled: bool,
2025-05-07T20:33:07.2364484Z     ) -> None:
2025-05-07T20:33:07.2364832Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2365245Z     
2025-05-07T20:33:07.2365687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2366270Z     
2025-05-07T20:33:07.2366585Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2367051Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2370577Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2373838Z 
2025-05-07T20:33:07.2374040Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.2374409Z 
2025-05-07T20:33:07.2374577Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2375274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2375946Z     T=2048,
2025-05-07T20:33:07.2376259Z     D=7168,
2025-05-07T20:33:07.2376574Z     scale_ub=None,
2025-05-07T20:33:07.2376918Z     contiguous=True,
2025-05-07T20:33:07.2377302Z     compiled=False,
2025-05-07T20:33:07.2377651Z )
2025-05-07T20:33:07.2378168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2378991Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.2379467Z 
2025-05-07T20:33:07.2379681Z     @given(
2025-05-07T20:33:07.2380146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2380666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2381281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2381837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2382382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2382870Z     )
2025-05-07T20:33:07.2383462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2384214Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2384618Z         self,
2025-05-07T20:33:07.2385022Z         T: int,
2025-05-07T20:33:07.2385340Z         D: int,
2025-05-07T20:33:07.2385692Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2386147Z         contiguous: bool,
2025-05-07T20:33:07.2386542Z         compiled: bool,
2025-05-07T20:33:07.2386912Z     ) -> None:
2025-05-07T20:33:07.2387261Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2387672Z     
2025-05-07T20:33:07.2388113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2388700Z     
2025-05-07T20:33:07.2389017Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.2392471Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2395848Z 
2025-05-07T20:33:07.2396049Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.2396428Z 
2025-05-07T20:33:07.2396598Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2397309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2398002Z     T=1,
2025-05-07T20:33:07.2398301Z     D=7168,
2025-05-07T20:33:07.2398678Z     scale_ub=1200.0,
2025-05-07T20:33:07.2399047Z     contiguous=True,
2025-05-07T20:33:07.2399405Z     compiled=False,
2025-05-07T20:33:07.2399747Z )
2025-05-07T20:33:07.5747114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5748031Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5748485Z 
2025-05-07T20:33:07.5748617Z     @given(
2025-05-07T20:33:07.5749379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5749903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5750407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5750898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5751405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5751873Z     )
2025-05-07T20:33:07.5752433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5753138Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5753504Z         self,
2025-05-07T20:33:07.5753799Z         T: int,
2025-05-07T20:33:07.5754099Z         D: int,
2025-05-07T20:33:07.5754419Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5754853Z         contiguous: bool,
2025-05-07T20:33:07.5755248Z         compiled: bool,
2025-05-07T20:33:07.5755605Z     ) -> None:
2025-05-07T20:33:07.5755965Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5756355Z     
2025-05-07T20:33:07.5756789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5757377Z     
2025-05-07T20:33:07.5757702Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5758173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5758684Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5759358Z         x0 = x[:, :D]
2025-05-07T20:33:07.5759717Z         x1 = x[:, D:]
2025-05-07T20:33:07.5760059Z     
2025-05-07T20:33:07.5760361Z         if contiguous:
2025-05-07T20:33:07.5760729Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5761166Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5761573Z     
2025-05-07T20:33:07.5761882Z         if scale_ub is not None:
2025-05-07T20:33:07.5762349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5762916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5763442Z             )
2025-05-07T20:33:07.5763749Z         else:
2025-05-07T20:33:07.5764106Z             scale_ub_tensor = None
2025-05-07T20:33:07.5764672Z     
2025-05-07T20:33:07.5765052Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5765592Z             op = silu_mul_quant
2025-05-07T20:33:07.5766024Z             if compiled:
2025-05-07T20:33:07.5766435Z                 op = torch.compile(op)
2025-05-07T20:33:07.5766950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5767431Z     
2025-05-07T20:33:07.5767744Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5768032Z 
2025-05-07T20:33:07.5768200Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5768709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5769278Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5769751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5770963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5772204Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5773115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5774313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5775497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5776443Z     kernel = self.compile(
2025-05-07T20:33:07.5777380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5778544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5779220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5779618Z 
2025-05-07T20:33:07.5779963Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e2d7910>
2025-05-07T20:33:07.5782033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5784396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e1a3040>}
2025-05-07T20:33:07.5786715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5788510Z context = <triton._C.libtriton.ir.context object at 0x7f158e1a0730>
2025-05-07T20:33:07.5789007Z 
2025-05-07T20:33:07.5789284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5790189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5791008Z                            module_map=module_map)
2025-05-07T20:33:07.5791617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5792208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5792639Z E       ^
2025-05-07T20:33:07.5793521Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5794362Z 
2025-05-07T20:33:07.5795099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5796019Z 
2025-05-07T20:33:07.5796193Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5796895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5797588Z     T=128,
2025-05-07T20:33:07.5797889Z     D=5120,
2025-05-07T20:33:07.5798213Z     scale_ub=None,
2025-05-07T20:33:07.5798622Z     contiguous=True,
2025-05-07T20:33:07.5798982Z     compiled=False,
2025-05-07T20:33:07.5799412Z )
2025-05-07T20:33:07.5799952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5800761Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.5801230Z 
2025-05-07T20:33:07.5801358Z     @given(
2025-05-07T20:33:07.5801743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5802268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5802791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5803357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5803919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5804398Z     )
2025-05-07T20:33:07.5804994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5805764Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5806164Z         self,
2025-05-07T20:33:07.5806485Z         T: int,
2025-05-07T20:33:07.5806816Z         D: int,
2025-05-07T20:33:07.5807167Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5807621Z         contiguous: bool,
2025-05-07T20:33:07.5808025Z         compiled: bool,
2025-05-07T20:33:07.5808403Z     ) -> None:
2025-05-07T20:33:07.5808790Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5809197Z     
2025-05-07T20:33:07.5809650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5810228Z     
2025-05-07T20:33:07.5810546Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5811018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5811542Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5811941Z         x0 = x[:, :D]
2025-05-07T20:33:07.5812285Z         x1 = x[:, D:]
2025-05-07T20:33:07.5812625Z     
2025-05-07T20:33:07.5812922Z         if contiguous:
2025-05-07T20:33:07.5813282Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5813712Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5814107Z     
2025-05-07T20:33:07.5814499Z         if scale_ub is not None:
2025-05-07T20:33:07.5814962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5815526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5816040Z             )
2025-05-07T20:33:07.5816355Z         else:
2025-05-07T20:33:07.5816693Z             scale_ub_tensor = None
2025-05-07T20:33:07.5817124Z     
2025-05-07T20:33:07.5817492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5818021Z             op = silu_mul_quant
2025-05-07T20:33:07.5818472Z             if compiled:
2025-05-07T20:33:07.5818883Z                 op = torch.compile(op)
2025-05-07T20:33:07.5819382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5819844Z     
2025-05-07T20:33:07.5820155Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5820438Z 
2025-05-07T20:33:07.5820597Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5821196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5821763Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5822231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5823430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5824803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5825677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5826852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5828015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5828939Z     kernel = self.compile(
2025-05-07T20:33:07.5829889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5831051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5831819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5832208Z 
2025-05-07T20:33:07.5832550Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e190e50>
2025-05-07T20:33:07.5834477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5836947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e1a3a60>}
2025-05-07T20:33:07.5839328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5841956Z context = <triton._C.libtriton.ir.context object at 0x7f158e5677f0>
2025-05-07T20:33:07.5842411Z 
2025-05-07T20:33:07.5842668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5843472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5844281Z                            module_map=module_map)
2025-05-07T20:33:07.5844883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5845475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5845912Z E       ^
2025-05-07T20:33:07.5846711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5847512Z 
2025-05-07T20:33:07.5848246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5849217Z 
2025-05-07T20:33:07.5849387Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5850234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5850930Z     T=128,
2025-05-07T20:33:07.5851237Z     D=7168,
2025-05-07T20:33:07.5851552Z     scale_ub=None,
2025-05-07T20:33:07.5851902Z     contiguous=True,
2025-05-07T20:33:07.5852257Z     compiled=False,
2025-05-07T20:33:07.5852602Z )
2025-05-07T20:33:07.6743490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
﻿2025-05-07T20:33:07.6748198Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6748587Z 
2025-05-07T20:33:07.6748702Z     @given(
2025-05-07T20:33:07.6749048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6749520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6749997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6750525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6751033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6751489Z     )
2025-05-07T20:33:07.6752032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6752745Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6753125Z         self,
2025-05-07T20:33:07.6753439Z         T: int,
2025-05-07T20:33:07.6753763Z         D: int,
2025-05-07T20:33:07.6754315Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6754788Z         contiguous: bool,
2025-05-07T20:33:07.6755197Z         compiled: bool,
2025-05-07T20:33:07.6755577Z     ) -> None:
2025-05-07T20:33:07.6755940Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6756338Z     
2025-05-07T20:33:07.6756783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6757329Z     
2025-05-07T20:33:07.6757636Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6758122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6758708Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6759270Z         x0 = x[:, :D]
2025-05-07T20:33:07.6759605Z         x1 = x[:, D:]
2025-05-07T20:33:07.6759943Z     
2025-05-07T20:33:07.6760248Z         if contiguous:
2025-05-07T20:33:07.6760614Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6761041Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6761438Z     
2025-05-07T20:33:07.6761748Z         if scale_ub is not None:
2025-05-07T20:33:07.6762212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6762774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6763301Z             )
2025-05-07T20:33:07.6763609Z         else:
2025-05-07T20:33:07.6763954Z             scale_ub_tensor = None
2025-05-07T20:33:07.6764372Z     
2025-05-07T20:33:07.6764743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6765278Z             op = silu_mul_quant
2025-05-07T20:33:07.6765699Z             if compiled:
2025-05-07T20:33:07.6766095Z                 op = torch.compile(op)
2025-05-07T20:33:07.6766609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6767077Z     
2025-05-07T20:33:07.6767387Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6767677Z 
2025-05-07T20:33:07.6767841Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6768342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6768912Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6769376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6770597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6771819Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6772735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6773911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6775212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6776166Z     kernel = self.compile(
2025-05-07T20:33:07.6777103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6778258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6778944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6779463Z 
2025-05-07T20:33:07.6779810Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e1ad550>
2025-05-07T20:33:07.6781788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6784141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e29f790>}
2025-05-07T20:33:07.6786467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6788297Z context = <triton._C.libtriton.ir.context object at 0x7f158e1806f0>
2025-05-07T20:33:07.6788795Z 
2025-05-07T20:33:07.6789068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6789970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6790779Z                            module_map=module_map)
2025-05-07T20:33:07.6791387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6791968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6792399Z E       ^
2025-05-07T20:33:07.6793208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6794044Z 
2025-05-07T20:33:07.6794754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6795671Z 
2025-05-07T20:33:07.6795845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6796552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6797237Z     T=2048,
2025-05-07T20:33:07.6797542Z     D=7168,
2025-05-07T20:33:07.6797858Z     scale_ub=1200.0,
2025-05-07T20:33:07.6798222Z     contiguous=True,
2025-05-07T20:33:07.6798619Z     compiled=False,
2025-05-07T20:33:07.6798973Z )
2025-05-07T20:33:07.6799509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6800338Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6800811Z 
2025-05-07T20:33:07.6800937Z     @given(
2025-05-07T20:33:07.6801309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6801831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6802351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6802909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6803467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6803951Z     )
2025-05-07T20:33:07.6804544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6805308Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6805700Z         self,
2025-05-07T20:33:07.6806021Z         T: int,
2025-05-07T20:33:07.6806345Z         D: int,
2025-05-07T20:33:07.6806685Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6807137Z         contiguous: bool,
2025-05-07T20:33:07.6807529Z         compiled: bool,
2025-05-07T20:33:07.6807891Z     ) -> None:
2025-05-07T20:33:07.6808251Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6808695Z     
2025-05-07T20:33:07.6809219Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6812907Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6816241Z 
2025-05-07T20:33:07.6816444Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6816820Z 
2025-05-07T20:33:07.6816988Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6817692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6818375Z     T=1,
2025-05-07T20:33:07.6818686Z     D=5120,
2025-05-07T20:33:07.6818997Z     scale_ub=1200.0,
2025-05-07T20:33:07.6819353Z     contiguous=True,
2025-05-07T20:33:07.6819718Z     compiled=False,
2025-05-07T20:33:07.6820058Z )
2025-05-07T20:33:07.7297110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7298220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.7298660Z 
2025-05-07T20:33:07.7298796Z     @given(
2025-05-07T20:33:07.7299159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7299676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7300194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7300753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7301434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7301928Z     )
2025-05-07T20:33:07.7302529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7303474Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7303895Z         self,
2025-05-07T20:33:07.7304211Z         T: int,
2025-05-07T20:33:07.7304545Z         D: int,
2025-05-07T20:33:07.7304911Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7305366Z         contiguous: bool,
2025-05-07T20:33:07.7305782Z         compiled: bool,
2025-05-07T20:33:07.7306167Z     ) -> None:
2025-05-07T20:33:07.7306532Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7306942Z     
2025-05-07T20:33:07.7307402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7307991Z     
2025-05-07T20:33:07.7308310Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.7308818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.7309362Z         x = x_sign * x_clamp
2025-05-07T20:33:07.7309763Z         x0 = x[:, :D]
2025-05-07T20:33:07.7310128Z         x1 = x[:, D:]
2025-05-07T20:33:07.7310478Z     
2025-05-07T20:33:07.7310798Z         if contiguous:
2025-05-07T20:33:07.7311199Z             x0 = x0.contiguous()
2025-05-07T20:33:07.7311643Z             x1 = x1.contiguous()
2025-05-07T20:33:07.7312049Z     
2025-05-07T20:33:07.7312379Z         if scale_ub is not None:
2025-05-07T20:33:07.7312852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.7313420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.7313953Z             )
2025-05-07T20:33:07.7314284Z         else:
2025-05-07T20:33:07.7314645Z             scale_ub_tensor = None
2025-05-07T20:33:07.7315068Z     
2025-05-07T20:33:07.7315462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7316011Z             op = silu_mul_quant
2025-05-07T20:33:07.7316439Z             if compiled:
2025-05-07T20:33:07.7316865Z                 op = torch.compile(op)
2025-05-07T20:33:07.7317379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7317848Z     
2025-05-07T20:33:07.7318183Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.7318621Z 
2025-05-07T20:33:07.7318814Z moe/activation_test.py:117: 
2025-05-07T20:33:07.7319314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7319897Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.7320393Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7321596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.7322933Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.7323877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.7325066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.7326223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.7327159Z     kernel = self.compile(
2025-05-07T20:33:07.7328119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.7329264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.7329940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7330448Z 
2025-05-07T20:33:07.7330803Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e168a00>
2025-05-07T20:33:07.7332696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.7335153Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e23f040>}
2025-05-07T20:33:07.7337524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.7339411Z context = <triton._C.libtriton.ir.context object at 0x7f158e2162f0>
2025-05-07T20:33:07.7339920Z 
2025-05-07T20:33:07.7340665Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.7341635Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.7342435Z                            module_map=module_map)
2025-05-07T20:33:07.7343050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.7343646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.7344087Z E       ^
2025-05-07T20:33:07.7344866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.7345667Z 
2025-05-07T20:33:07.7346398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7347303Z 
2025-05-07T20:33:07.7347488Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7348188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7348931Z     T=2048,
2025-05-07T20:33:07.7349247Z     D=5120,
2025-05-07T20:33:07.7349568Z     scale_ub=None,
2025-05-07T20:33:07.7349917Z     contiguous=True,
2025-05-07T20:33:07.7350294Z     compiled=False,
2025-05-07T20:33:07.7350642Z )
2025-05-07T20:33:07.7351165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7351999Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.7352464Z 
2025-05-07T20:33:07.7352601Z     @given(
2025-05-07T20:33:07.7352974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7353504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7354169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7354740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7355308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7355798Z     )
2025-05-07T20:33:07.7356393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7357160Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7369534Z         self,
2025-05-07T20:33:07.7369886Z         T: int,
2025-05-07T20:33:07.7370390Z         D: int,
2025-05-07T20:33:07.7370752Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7371229Z         contiguous: bool,
2025-05-07T20:33:07.7371638Z         compiled: bool,
2025-05-07T20:33:07.7372012Z     ) -> None:
2025-05-07T20:33:07.7372388Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7372806Z     
2025-05-07T20:33:07.7373274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7373864Z     
2025-05-07T20:33:07.7374211Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.7377786Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.7381182Z 
2025-05-07T20:33:07.7381403Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.7381770Z 
2025-05-07T20:33:07.7381939Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7382653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7383353Z     T=16384,
2025-05-07T20:33:07.7383685Z     D=5120,
2025-05-07T20:33:07.7384150Z     scale_ub=None,
2025-05-07T20:33:07.7384510Z     contiguous=True,
2025-05-07T20:33:07.7384900Z     compiled=False,
2025-05-07T20:33:07.7385240Z )
2025-05-07T20:33:07.7385782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7386652Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.7387132Z 
2025-05-07T20:33:07.7387262Z     @given(
2025-05-07T20:33:07.7387654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7388193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7388711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7389284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7389855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7390353Z     )
2025-05-07T20:33:07.7390950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7391733Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7392151Z         self,
2025-05-07T20:33:07.7392470Z         T: int,
2025-05-07T20:33:07.7392803Z         D: int,
2025-05-07T20:33:07.7393174Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7393633Z         contiguous: bool,
2025-05-07T20:33:07.7394043Z         compiled: bool,
2025-05-07T20:33:07.7394429Z     ) -> None:
2025-05-07T20:33:07.7394785Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7395196Z     
2025-05-07T20:33:07.7395664Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7399311Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.7402605Z 
2025-05-07T20:33:07.7402836Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.7403205Z 
2025-05-07T20:33:07.7403389Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7404106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7404907Z     T=4096,
2025-05-07T20:33:07.7405221Z     D=5120,
2025-05-07T20:33:07.7405548Z     scale_ub=None,
2025-05-07T20:33:07.7405913Z     contiguous=True,
2025-05-07T20:33:07.7406284Z     compiled=False,
2025-05-07T20:33:07.7406638Z )
2025-05-07T20:33:07.8423927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8424819Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.8425287Z 
2025-05-07T20:33:07.8425414Z     @given(
2025-05-07T20:33:07.8425816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8426317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8426828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8427388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8428167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8428665Z     )
2025-05-07T20:33:07.8429280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8430064Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8430472Z         self,
2025-05-07T20:33:07.8430807Z         T: int,
2025-05-07T20:33:07.8431146Z         D: int,
2025-05-07T20:33:07.8431508Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8431975Z         contiguous: bool,
2025-05-07T20:33:07.8432379Z         compiled: bool,
2025-05-07T20:33:07.8432752Z     ) -> None:
2025-05-07T20:33:07.8433128Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8433683Z     
2025-05-07T20:33:07.8434129Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8437771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8441353Z 
2025-05-07T20:33:07.8441557Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.8441941Z 
2025-05-07T20:33:07.8442113Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8442823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8443523Z     T=2048,
2025-05-07T20:33:07.8443833Z     D=5120,
2025-05-07T20:33:07.8444153Z     scale_ub=None,
2025-05-07T20:33:07.8444505Z     contiguous=False,
2025-05-07T20:33:07.8444881Z     compiled=False,
2025-05-07T20:33:07.8445230Z )
2025-05-07T20:33:07.8445760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8446615Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8447094Z 
2025-05-07T20:33:07.8447218Z     @given(
2025-05-07T20:33:07.8447599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8448119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8448649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8449213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8449758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8450245Z     )
2025-05-07T20:33:07.8451000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8451769Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8452180Z         self,
2025-05-07T20:33:07.8452504Z         T: int,
2025-05-07T20:33:07.8452829Z         D: int,
2025-05-07T20:33:07.8453197Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8453660Z         contiguous: bool,
2025-05-07T20:33:07.8454061Z         compiled: bool,
2025-05-07T20:33:07.8454432Z     ) -> None:
2025-05-07T20:33:07.8454780Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8455324Z     
2025-05-07T20:33:07.8455774Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8459434Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8462793Z 
2025-05-07T20:33:07.8463003Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.8463478Z 
2025-05-07T20:33:07.8463656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8464357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8465045Z     T=4096,
2025-05-07T20:33:07.8465343Z     D=7168,
2025-05-07T20:33:07.8465656Z     scale_ub=None,
2025-05-07T20:33:07.8466010Z     contiguous=True,
2025-05-07T20:33:07.8466377Z     compiled=True,
2025-05-07T20:33:07.8466701Z )
2025-05-07T20:33:07.8467232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8468072Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8468653Z 
2025-05-07T20:33:07.8468783Z     @given(
2025-05-07T20:33:07.8469164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8469682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8470184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8470744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8471303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8471778Z     )
2025-05-07T20:33:07.8472382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8473147Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8473550Z         self,
2025-05-07T20:33:07.8473856Z         T: int,
2025-05-07T20:33:07.8474177Z         D: int,
2025-05-07T20:33:07.8474533Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8474976Z         contiguous: bool,
2025-05-07T20:33:07.8475374Z         compiled: bool,
2025-05-07T20:33:07.8475750Z     ) -> None:
2025-05-07T20:33:07.8476098Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8476491Z     
2025-05-07T20:33:07.8476933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8480667Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8483988Z 
2025-05-07T20:33:07.8484196Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.8484562Z 
2025-05-07T20:33:07.8484733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8485518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8486212Z     T=2048,
2025-05-07T20:33:07.8486510Z     D=5120,
2025-05-07T20:33:07.8486821Z     scale_ub=1200.0,
2025-05-07T20:33:07.8487193Z     contiguous=False,
2025-05-07T20:33:07.8487554Z     compiled=False,
2025-05-07T20:33:07.8487891Z )
2025-05-07T20:33:07.8488442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8489259Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8489823Z 
2025-05-07T20:33:07.8489949Z     @given(
2025-05-07T20:33:07.8490323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8490851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8491360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8491915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8492474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8492962Z     )
2025-05-07T20:33:07.8493555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8494317Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8494709Z         self,
2025-05-07T20:33:07.8495032Z         T: int,
2025-05-07T20:33:07.8495354Z         D: int,
2025-05-07T20:33:07.8495779Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8496237Z         contiguous: bool,
2025-05-07T20:33:07.8496625Z         compiled: bool,
2025-05-07T20:33:07.8496991Z     ) -> None:
2025-05-07T20:33:07.8497346Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8497746Z     
2025-05-07T20:33:07.8498192Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8501854Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8505365Z 
2025-05-07T20:33:07.8505577Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.8505954Z 
2025-05-07T20:33:07.8506121Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8506830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8507508Z     T=4096,
2025-05-07T20:33:07.8507820Z     D=7168,
2025-05-07T20:33:07.8508133Z     scale_ub=1200.0,
2025-05-07T20:33:07.8508503Z     contiguous=True,
2025-05-07T20:33:07.8508881Z     compiled=False,
2025-05-07T20:33:07.8509221Z )
2025-05-07T20:33:07.8509749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8510600Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.8511072Z 
2025-05-07T20:33:07.8511207Z     @given(
2025-05-07T20:33:07.8511566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8512095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8512607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8513163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8513711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8514196Z     )
2025-05-07T20:33:07.8514791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8515570Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8515975Z         self,
2025-05-07T20:33:07.8516280Z         T: int,
2025-05-07T20:33:07.8516590Z         D: int,
2025-05-07T20:33:07.8516924Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8517370Z         contiguous: bool,
2025-05-07T20:33:07.8517849Z         compiled: bool,
2025-05-07T20:33:07.8518222Z     ) -> None:
2025-05-07T20:33:07.8518573Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8518972Z     
2025-05-07T20:33:07.8519418Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8523094Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8526533Z 
2025-05-07T20:33:07.8526730Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.8527091Z 
2025-05-07T20:33:07.8527274Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8527969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8528670Z     T=16384,
2025-05-07T20:33:07.8528986Z     D=7168,
2025-05-07T20:33:07.8529289Z     scale_ub=None,
2025-05-07T20:33:07.8529647Z     contiguous=False,
2025-05-07T20:33:07.8530086Z     compiled=True,
2025-05-07T20:33:07.8530429Z )
2025-05-07T20:33:07.9821830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9822750Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.9823206Z 
2025-05-07T20:33:07.9823345Z     @given(
2025-05-07T20:33:07.9823716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9824251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9824768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9825324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9826016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9826445Z     )
2025-05-07T20:33:07.9826988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9827687Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9828076Z         self,
2025-05-07T20:33:07.9828383Z         T: int,
2025-05-07T20:33:07.9828692Z         D: int,
2025-05-07T20:33:07.9829041Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9829497Z         contiguous: bool,
2025-05-07T20:33:07.9829889Z         compiled: bool,
2025-05-07T20:33:07.9830276Z     ) -> None:
2025-05-07T20:33:07.9830640Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9831042Z     
2025-05-07T20:33:07.9831503Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9835187Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9838567Z 
2025-05-07T20:33:07.9838773Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9839143Z 
2025-05-07T20:33:07.9839324Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9840034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9841046Z     T=4096,
2025-05-07T20:33:07.9841367Z     D=7168,
2025-05-07T20:33:07.9841677Z     scale_ub=None,
2025-05-07T20:33:07.9842032Z     contiguous=True,
2025-05-07T20:33:07.9842406Z     compiled=False,
2025-05-07T20:33:07.9842743Z )
2025-05-07T20:33:07.9843441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9844310Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9844787Z 
2025-05-07T20:33:07.9844923Z     @given(
2025-05-07T20:33:07.9845297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9845835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9846357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9846913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9847616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9848109Z     )
2025-05-07T20:33:07.9848707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9849483Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9849897Z         self,
2025-05-07T20:33:07.9850218Z         T: int,
2025-05-07T20:33:07.9850538Z         D: int,
2025-05-07T20:33:07.9850901Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9851375Z         contiguous: bool,
2025-05-07T20:33:07.9851771Z         compiled: bool,
2025-05-07T20:33:07.9852151Z     ) -> None:
2025-05-07T20:33:07.9852509Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9852918Z     
2025-05-07T20:33:07.9853375Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9857050Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9860364Z 
2025-05-07T20:33:07.9860587Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9861155Z 
2025-05-07T20:33:07.9861338Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9862034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9862731Z     T=16384,
2025-05-07T20:33:07.9863053Z     D=7168,
2025-05-07T20:33:07.9863373Z     scale_ub=None,
2025-05-07T20:33:07.9863730Z     contiguous=True,
2025-05-07T20:33:07.9864112Z     compiled=False,
2025-05-07T20:33:07.9864459Z )
2025-05-07T20:33:07.9865000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9865853Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9866340Z 
2025-05-07T20:33:07.9866471Z     @given(
2025-05-07T20:33:07.9866857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9867392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9867912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9868487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9869103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9869587Z     )
2025-05-07T20:33:07.9870174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9870938Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9871345Z         self,
2025-05-07T20:33:07.9871665Z         T: int,
2025-05-07T20:33:07.9872000Z         D: int,
2025-05-07T20:33:07.9872369Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9872816Z         contiguous: bool,
2025-05-07T20:33:07.9873226Z         compiled: bool,
2025-05-07T20:33:07.9873603Z     ) -> None:
2025-05-07T20:33:07.9873956Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9874368Z     
2025-05-07T20:33:07.9874825Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9878366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9881755Z 
2025-05-07T20:33:07.9881980Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9882348Z 
2025-05-07T20:33:07.9882522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9883232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9883925Z     T=16384,
2025-05-07T20:33:07.9884252Z     D=7168,
2025-05-07T20:33:07.9884585Z     scale_ub=1200.0,
2025-05-07T20:33:07.9884964Z     contiguous=True,
2025-05-07T20:33:07.9885341Z     compiled=False,
2025-05-07T20:33:07.9885705Z )
2025-05-07T20:33:07.9886245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9887103Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9887585Z 
2025-05-07T20:33:07.9887717Z     @given(
2025-05-07T20:33:07.9888192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9888741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9889253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9889822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9890398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9890882Z     )
2025-05-07T20:33:07.9891476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9892249Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9892663Z         self,
2025-05-07T20:33:07.9892990Z         T: int,
2025-05-07T20:33:07.9893423Z         D: int,
2025-05-07T20:33:07.9893800Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9894256Z         contiguous: bool,
2025-05-07T20:33:07.9894667Z         compiled: bool,
2025-05-07T20:33:07.9895035Z     ) -> None:
2025-05-07T20:33:07.9895383Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9895785Z     
2025-05-07T20:33:07.9896240Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9899836Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9903197Z 
2025-05-07T20:33:07.9903413Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9903776Z 
2025-05-07T20:33:07.9903948Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9904645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9905334Z     T=128,
2025-05-07T20:33:07.9905644Z     D=5120,
2025-05-07T20:33:07.9905964Z     scale_ub=1200.0,
2025-05-07T20:33:07.9906338Z     contiguous=False,
2025-05-07T20:33:07.9906705Z     compiled=False,
2025-05-07T20:33:07.9907052Z )
2025-05-07T20:33:08.1511063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.1511978Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.1512438Z 
2025-05-07T20:33:08.1512572Z     @given(
2025-05-07T20:33:08.1512944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.1513447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.1514390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.1514959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.1515469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.1515876Z     )
2025-05-07T20:33:08.1516370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.1516995Z     def test_silu_mul_quant(
2025-05-07T20:33:08.1517339Z         self,
2025-05-07T20:33:08.1517785Z         T: int,
2025-05-07T20:33:08.1518066Z         D: int,
2025-05-07T20:33:08.1518382Z         scale_ub: Optional[float],
2025-05-07T20:33:08.1518765Z         contiguous: bool,
2025-05-07T20:33:08.1519107Z         compiled: bool,
2025-05-07T20:33:08.1519442Z     ) -> None:
2025-05-07T20:33:08.1519750Z         torch.manual_seed(2025)
2025-05-07T20:33:08.1520088Z     
2025-05-07T20:33:08.1520480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.1520983Z     
2025-05-07T20:33:08.1521264Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.1521689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.1522139Z         x = x_sign * x_clamp
2025-05-07T20:33:08.1522486Z         x0 = x[:, :D]
2025-05-07T20:33:08.1522813Z         x1 = x[:, D:]
2025-05-07T20:33:08.1523140Z     
2025-05-07T20:33:08.1535775Z         if contiguous:
2025-05-07T20:33:08.1536182Z             x0 = x0.contiguous()
2025-05-07T20:33:08.1536599Z             x1 = x1.contiguous()
2025-05-07T20:33:08.1537008Z     
2025-05-07T20:33:08.1537328Z         if scale_ub is not None:
2025-05-07T20:33:08.1537771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.1538321Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.1538837Z             )
2025-05-07T20:33:08.1539149Z         else:
2025-05-07T20:33:08.1539499Z             scale_ub_tensor = None
2025-05-07T20:33:08.1539926Z     
2025-05-07T20:33:08.1540697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.1541505Z             op = silu_mul_quant
2025-05-07T20:33:08.1541927Z             if compiled:
2025-05-07T20:33:08.1542324Z                 op = torch.compile(op)
2025-05-07T20:33:08.1542827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1543294Z     
2025-05-07T20:33:08.1543612Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.1543907Z 
2025-05-07T20:33:08.1544071Z moe/activation_test.py:117: 
2025-05-07T20:33:08.1544584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1545165Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.1545634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1546847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.1548076Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.1548999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.1550204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.1551357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.1552290Z     kernel = self.compile(
2025-05-07T20:33:08.1553234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.1554387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.1555052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1555449Z 
2025-05-07T20:33:08.1555808Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e0300d0>
2025-05-07T20:33:08.1557838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.1560232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e041ca0>}
2025-05-07T20:33:08.1562475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.1564340Z context = <triton._C.libtriton.ir.context object at 0x7f158df87c70>
2025-05-07T20:33:08.1564808Z 
2025-05-07T20:33:08.1565076Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.1565955Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.1566752Z                            module_map=module_map)
2025-05-07T20:33:08.1567371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.1567961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.1568397Z E       ^
2025-05-07T20:33:08.1569237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.1570041Z 
2025-05-07T20:33:08.1570879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.1571768Z 
2025-05-07T20:33:08.1571950Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.1572646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.1573346Z     T=2048,
2025-05-07T20:33:08.1573646Z     D=7168,
2025-05-07T20:33:08.1573962Z     scale_ub=None,
2025-05-07T20:33:08.1574323Z     contiguous=False,
2025-05-07T20:33:08.1574688Z     compiled=False,
2025-05-07T20:33:08.1575038Z )
2025-05-07T20:33:08.1575576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.1576486Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.1576958Z 
2025-05-07T20:33:08.1577084Z     @given(
2025-05-07T20:33:08.1577472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.1577997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.1578506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.1579069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.1579637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.1580119Z     )
2025-05-07T20:33:08.1580718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.1581603Z     def test_silu_mul_quant(
2025-05-07T20:33:08.1582002Z         self,
2025-05-07T20:33:08.1582318Z         T: int,
2025-05-07T20:33:08.1582630Z         D: int,
2025-05-07T20:33:08.1582980Z         scale_ub: Optional[float],
2025-05-07T20:33:08.1583434Z         contiguous: bool,
2025-05-07T20:33:08.1583842Z         compiled: bool,
2025-05-07T20:33:08.1584207Z     ) -> None:
2025-05-07T20:33:08.1584558Z         torch.manual_seed(2025)
2025-05-07T20:33:08.1584965Z     
2025-05-07T20:33:08.1585411Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.1589080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.1592406Z 
2025-05-07T20:33:08.1592608Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.1592986Z 
2025-05-07T20:33:08.1593241Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.1593953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.1594640Z     T=128,
2025-05-07T20:33:08.1594943Z     D=7168,
2025-05-07T20:33:08.1595256Z     scale_ub=1200.0,
2025-05-07T20:33:08.1595629Z     contiguous=True,
2025-05-07T20:33:08.1595984Z     compiled=True,
2025-05-07T20:33:08.1596323Z )
2025-05-07T20:33:08.2034638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2035760Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.2036208Z 
2025-05-07T20:33:08.2036351Z     @given(
2025-05-07T20:33:08.2036677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2037137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2037611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2038169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2038738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2039219Z     )
2025-05-07T20:33:08.2039822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2040867Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2041248Z         self,
2025-05-07T20:33:08.2041698Z         T: int,
2025-05-07T20:33:08.2042002Z         D: int,
2025-05-07T20:33:08.2042339Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2042788Z         contiguous: bool,
2025-05-07T20:33:08.2043192Z         compiled: bool,
2025-05-07T20:33:08.2043543Z     ) -> None:
2025-05-07T20:33:08.2043879Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2044271Z     
2025-05-07T20:33:08.2044729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2045318Z     
2025-05-07T20:33:08.2045633Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2046087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2046693Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2047039Z         x0 = x[:, :D]
2025-05-07T20:33:08.2047362Z         x1 = x[:, D:]
2025-05-07T20:33:08.2047678Z     
2025-05-07T20:33:08.2047965Z         if contiguous:
2025-05-07T20:33:08.2048326Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2048780Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2049168Z     
2025-05-07T20:33:08.2049452Z         if scale_ub is not None:
2025-05-07T20:33:08.2049853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2050325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2050774Z             )
2025-05-07T20:33:08.2051050Z         else:
2025-05-07T20:33:08.2051347Z             scale_ub_tensor = None
2025-05-07T20:33:08.2051720Z     
2025-05-07T20:33:08.2052063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2052536Z             op = silu_mul_quant
2025-05-07T20:33:08.2052926Z             if compiled:
2025-05-07T20:33:08.2053313Z                 op = torch.compile(op)
2025-05-07T20:33:08.2053775Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2054212Z     
2025-05-07T20:33:08.2054503Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2054739Z 
2025-05-07T20:33:08.2054876Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2055308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2055798Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2056214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2057070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.2057971Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.2058987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2060017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2060946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2062163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2063196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2064011Z     kernel = self.compile(
2025-05-07T20:33:08.2064852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2066019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2066680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2067030Z 
2025-05-07T20:33:08.2067327Z self = <triton.compiler.compiler.ASTSource object at 0x7f158df89460>
2025-05-07T20:33:08.2069043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2071255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158df3a0d0>}
2025-05-07T20:33:08.2073478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2075080Z context = <triton._C.libtriton.ir.context object at 0x7f158deb89b0>
2025-05-07T20:33:08.2075538Z 
2025-05-07T20:33:08.2075795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2076632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2077339Z                            module_map=module_map)
2025-05-07T20:33:08.2077939Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2078493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2078931Z E       ^
2025-05-07T20:33:08.2079665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2080404Z 
2025-05-07T20:33:08.2081046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2081866Z 
2025-05-07T20:33:08.2082028Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2082690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2083326Z     T=128,
2025-05-07T20:33:08.2083612Z     D=7168,
2025-05-07T20:33:08.2083910Z     scale_ub=1200.0,
2025-05-07T20:33:08.2084230Z     contiguous=True,
2025-05-07T20:33:08.2084562Z     compiled=False,
2025-05-07T20:33:08.2084857Z )
2025-05-07T20:33:08.2085308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2086020Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.2086423Z 
2025-05-07T20:33:08.2086528Z     @given(
2025-05-07T20:33:08.2086847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2087287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2087732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2088219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2088693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2089110Z     )
2025-05-07T20:33:08.2089621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2090268Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2090617Z         self,
2025-05-07T20:33:08.2090888Z         T: int,
2025-05-07T20:33:08.2091162Z         D: int,
2025-05-07T20:33:08.2091470Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2091967Z         contiguous: bool,
2025-05-07T20:33:08.2092319Z         compiled: bool,
2025-05-07T20:33:08.2092630Z     ) -> None:
2025-05-07T20:33:08.2092935Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2093293Z     
2025-05-07T20:33:08.2093679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2094202Z     
2025-05-07T20:33:08.2094469Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2094892Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2098076Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.2101101Z 
2025-05-07T20:33:08.2101285Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.2101617Z 
2025-05-07T20:33:08.2101780Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2102484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2103131Z     T=128,
2025-05-07T20:33:08.2103417Z     D=5120,
2025-05-07T20:33:08.2103696Z     scale_ub=1200.0,
2025-05-07T20:33:08.2104030Z     contiguous=True,
2025-05-07T20:33:08.2104362Z     compiled=True,
2025-05-07T20:33:08.2104664Z )
2025-05-07T20:33:08.2105161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2105923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.2106323Z 
2025-05-07T20:33:08.2106439Z     @given(
2025-05-07T20:33:08.2106775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2107305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2107771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2108275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2108800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2109262Z     )
2025-05-07T20:33:08.2109818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2110486Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2110847Z         self,
2025-05-07T20:33:08.2111132Z         T: int,
2025-05-07T20:33:08.2111412Z         D: int,
2025-05-07T20:33:08.2111734Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2112146Z         contiguous: bool,
2025-05-07T20:33:08.2112502Z         compiled: bool,
2025-05-07T20:33:08.2112854Z     ) -> None:
2025-05-07T20:33:08.2113195Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2113577Z     
2025-05-07T20:33:08.2114003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2114509Z     
2025-05-07T20:33:08.2114782Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2115228Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2118354Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.2121243Z 
2025-05-07T20:33:08.2121430Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.2121752Z 
2025-05-07T20:33:08.2121971Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2122614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2123241Z     T=128,
2025-05-07T20:33:08.2123518Z     D=7168,
2025-05-07T20:33:08.2123790Z     scale_ub=None,
2025-05-07T20:33:08.2124108Z     contiguous=True,
2025-05-07T20:33:08.2124440Z     compiled=True,
2025-05-07T20:33:08.2124742Z )
2025-05-07T20:33:08.4281295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4282506Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4282943Z 
2025-05-07T20:33:08.4283074Z     @given(
2025-05-07T20:33:08.4283431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4283876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4284359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4284904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4285464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4285950Z     )
2025-05-07T20:33:08.4286500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4287220Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4287625Z         self,
2025-05-07T20:33:08.4287940Z         T: int,
2025-05-07T20:33:08.4288419Z         D: int,
2025-05-07T20:33:08.4288825Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4289298Z         contiguous: bool,
2025-05-07T20:33:08.4289711Z         compiled: bool,
2025-05-07T20:33:08.4290081Z     ) -> None:
2025-05-07T20:33:08.4290429Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4290828Z     
2025-05-07T20:33:08.4291266Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4294897Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.4298315Z 
2025-05-07T20:33:08.4298513Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.4298865Z 
2025-05-07T20:33:08.4307572Z FAILED
2025-05-07T20:33:08.4307766Z 
2025-05-07T20:33:08.4307950Z =================================== FAILURES ===================================
2025-05-07T20:33:08.4308560Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:08.4309219Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:08.4310108Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:08.4310758Z   |     yield
2025-05-07T20:33:08.4311240Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:33:08.4311813Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:08.4312432Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:33:08.4313076Z   |     method()
2025-05-07T20:33:08.4313794Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:08.4314791Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4315648Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:08.4316284Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:08.4316935Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:08.4317560Z   +-+---------------- 1 ----------------
2025-05-07T20:33:08.4317871Z     | Traceback (most recent call last):
2025-05-07T20:33:08.4318825Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:08.4319923Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4322791Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.4325662Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:08.4326278Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4326840Z     |     T=2048,
2025-05-07T20:33:08.4327167Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:08.4327704Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:08.4328708Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:08.4329228Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:08.4329655Z     | )
2025-05-07T20:33:08.4329905Z     | 
2025-05-07T20:33:08.4330638Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:08.4331479Z     +---------------- 2 ----------------
2025-05-07T20:33:08.4331890Z     | Traceback (most recent call last):
2025-05-07T20:33:08.4332961Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:08.4334046Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4336898Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.4339682Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:08.4340584Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4341291Z     |     T=128,
2025-05-07T20:33:08.4341579Z     |     D=7168,
2025-05-07T20:33:08.4341863Z     |     scale_ub=None,
2025-05-07T20:33:08.4342201Z     |     contiguous=True,
2025-05-07T20:33:08.4342549Z     |     compiled=True,
2025-05-07T20:33:08.4342855Z     | )
2025-05-07T20:33:08.4343108Z     | 
2025-05-07T20:33:08.4343845Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:08.4344567Z     +---------------- 3 ----------------
2025-05-07T20:33:08.4344863Z     | Traceback (most recent call last):
2025-05-07T20:33:08.4345582Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:08.4346367Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4364841Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.4367007Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:08.4367627Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4368207Z     |     T=128,
2025-05-07T20:33:08.4368504Z     |     D=5120,
2025-05-07T20:33:08.4368796Z     |     scale_ub=1200.0,
2025-05-07T20:33:08.4369150Z     |     contiguous=True,
2025-05-07T20:33:08.4369500Z     |     compiled=True,
2025-05-07T20:33:08.4369826Z     | )
2025-05-07T20:33:08.4370097Z     | 
2025-05-07T20:33:08.4370847Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:08.4371711Z     +---------------- 4 ----------------
2025-05-07T20:33:08.4372125Z     | Traceback (most recent call last):
2025-05-07T20:33:08.4373219Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:08.4374237Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4375154Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:08.4376148Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4377324Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:08.4378550Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4379411Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:08.4380470Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4381658Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:08.4382758Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4383874Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:08.4384996Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4386089Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:08.4387068Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4387985Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:08.4388771Z     |     fn()
2025-05-07T20:33:08.4389570Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:08.4390461Z     |     self.fn.run(
2025-05-07T20:33:08.4391217Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:08.4392055Z     |     kernel = self.compile(
2025-05-07T20:33:08.4392993Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:08.4393989Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4394992Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:08.4396124Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4396860Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4397417Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4397803Z     | ^
2025-05-07T20:33:08.4398459Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4399244Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:08.4399820Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:08.4400548Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4401173Z     |     T=1,  # or any other generated value
2025-05-07T20:33:08.4401614Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:08.4402104Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:08.4402639Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:08.4403205Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:08.4403655Z     | )
2025-05-07T20:33:08.4403913Z     | 
2025-05-07T20:33:08.4404647Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:08.4405520Z     +------------------------------------
2025-05-07T20:33:08.4406040Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:08.4406580Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4407168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4407804Z     T=1,
2025-05-07T20:33:08.4408075Z     D=5120,
2025-05-07T20:33:08.4408339Z     scale_ub=None,
2025-05-07T20:33:08.4408672Z     contiguous=True,
2025-05-07T20:33:08.4409015Z     compiled=True,
2025-05-07T20:33:08.4409321Z )
2025-05-07T20:33:08.4409793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4410478Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4410850Z 
2025-05-07T20:33:08.4410975Z     @given(
2025-05-07T20:33:08.4411301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4411759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4412209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4412685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4413167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4413586Z     )
2025-05-07T20:33:08.4414083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4414712Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4415063Z         self,
2025-05-07T20:33:08.4415333Z         T: int,
2025-05-07T20:33:08.4415601Z         D: int,
2025-05-07T20:33:08.4415909Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4416281Z         contiguous: bool,
2025-05-07T20:33:08.4416634Z         compiled: bool,
2025-05-07T20:33:08.4416956Z     ) -> None:
2025-05-07T20:33:08.4417271Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4417620Z     
2025-05-07T20:33:08.4418010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4418507Z     
2025-05-07T20:33:08.4418784Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4419198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4419651Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4419984Z         x0 = x[:, :D]
2025-05-07T20:33:08.4420281Z         x1 = x[:, D:]
2025-05-07T20:33:08.4420629Z     
2025-05-07T20:33:08.4420885Z         if contiguous:
2025-05-07T20:33:08.4421310Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4421672Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4421998Z     
2025-05-07T20:33:08.4422283Z         if scale_ub is not None:
2025-05-07T20:33:08.4422672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4423153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4423601Z             )
2025-05-07T20:33:08.4423945Z         else:
2025-05-07T20:33:08.4424257Z             scale_ub_tensor = None
2025-05-07T20:33:08.4424620Z     
2025-05-07T20:33:08.4424959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4425423Z             op = silu_mul_quant
2025-05-07T20:33:08.4425784Z             if compiled:
2025-05-07T20:33:08.4426150Z                 op = torch.compile(op)
2025-05-07T20:33:08.4426578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4426976Z     
2025-05-07T20:33:08.4427261Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4427676Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4428086Z     
2025-05-07T20:33:08.4428420Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4428874Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4429337Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4429778Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4430286Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4430730Z     
2025-05-07T20:33:08.4431020Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4431294Z 
2025-05-07T20:33:08.4431443Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4431862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4432345Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4432817Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4433987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4435053Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4435830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4436789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4437737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4438725Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4439723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4441074Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4442112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4442994Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4443784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4444469Z     fn()
2025-05-07T20:33:08.4445126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4445893Z     self.fn.run(
2025-05-07T20:33:08.4446509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4447196Z     kernel = self.compile(
2025-05-07T20:33:08.4447901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4448875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4449408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4449714Z 
2025-05-07T20:33:08.4449984Z self = <triton.compiler.compiler.ASTSource object at 0x7f15924b7040>
2025-05-07T20:33:08.4451469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4453452Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15925d89d0>}
2025-05-07T20:33:08.4455283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4456677Z context = <triton._C.libtriton.ir.context object at 0x7f1592bb34f0>
2025-05-07T20:33:08.4457086Z 
2025-05-07T20:33:08.4457328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4458074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4458852Z                            module_map=module_map)
2025-05-07T20:33:08.4459365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4459881Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4460267Z E       ^
2025-05-07T20:33:08.4460911Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4461600Z 
2025-05-07T20:33:08.4462147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4462823Z 
2025-05-07T20:33:08.4462961Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4463600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4464124Z     T=2048,
2025-05-07T20:33:08.4464375Z     D=5120,
2025-05-07T20:33:08.4464636Z     scale_ub=1200.0,
2025-05-07T20:33:08.4464925Z     contiguous=True,
2025-05-07T20:33:08.4465221Z     compiled=False,
2025-05-07T20:33:08.4465500Z )
2025-05-07T20:33:08.4465914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4466573Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.4466937Z 
2025-05-07T20:33:08.4467041Z     @given(
2025-05-07T20:33:08.4467343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4467750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4468159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4468603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4469094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4469481Z     )
2025-05-07T20:33:08.4469944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4470521Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4470847Z         self,
2025-05-07T20:33:08.4471109Z         T: int,
2025-05-07T20:33:08.4471390Z         D: int,
2025-05-07T20:33:08.4471689Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4472075Z         contiguous: bool,
2025-05-07T20:33:08.4472420Z         compiled: bool,
2025-05-07T20:33:08.4472720Z     ) -> None:
2025-05-07T20:33:08.4473015Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4473347Z     
2025-05-07T20:33:08.4473709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4474175Z     
2025-05-07T20:33:08.4474436Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4474827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4475254Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4475661Z         x0 = x[:, :D]
2025-05-07T20:33:08.4475952Z         x1 = x[:, D:]
2025-05-07T20:33:08.4476256Z     
2025-05-07T20:33:08.4476510Z         if contiguous:
2025-05-07T20:33:08.4476824Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4477180Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4477507Z     
2025-05-07T20:33:08.4477768Z         if scale_ub is not None:
2025-05-07T20:33:08.4478144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4478667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4479081Z             )
2025-05-07T20:33:08.4479346Z         else:
2025-05-07T20:33:08.4479632Z             scale_ub_tensor = None
2025-05-07T20:33:08.4479975Z     
2025-05-07T20:33:08.4480310Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4480766Z             op = silu_mul_quant
2025-05-07T20:33:08.4481127Z             if compiled:
2025-05-07T20:33:08.4481486Z                 op = torch.compile(op)
2025-05-07T20:33:08.4481921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4482318Z     
2025-05-07T20:33:08.4482590Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4482836Z 
2025-05-07T20:33:08.4482980Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4483459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4483930Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4484341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4485316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4486274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4487013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4487955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4488866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4489657Z     kernel = self.compile(
2025-05-07T20:33:08.4490423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4491327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4491872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4492186Z 
2025-05-07T20:33:08.4492464Z self = <triton.compiler.compiler.ASTSource object at 0x7f15924c7eb0>
2025-05-07T20:33:08.4493978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4495941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f156fced5e0>}
2025-05-07T20:33:08.4497843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4499301Z context = <triton._C.libtriton.ir.context object at 0x7f159112ddf0>
2025-05-07T20:33:08.4499704Z 
2025-05-07T20:33:08.4499944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4500696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4501457Z                            module_map=module_map)
2025-05-07T20:33:08.4501928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4502401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4502770Z E       ^
2025-05-07T20:33:08.4503482Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4504123Z 
2025-05-07T20:33:08.4504696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4505414Z 
2025-05-07T20:33:08.4505560Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4506132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4506747Z     T=2048,
2025-05-07T20:33:08.4507010Z     D=5120,
2025-05-07T20:33:08.4507276Z     scale_ub=1200.0,
2025-05-07T20:33:08.4507594Z     contiguous=True,
2025-05-07T20:33:08.4507912Z     compiled=True,
2025-05-07T20:33:08.4508211Z )
2025-05-07T20:33:08.4508685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4509407Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.4509798Z 
2025-05-07T20:33:08.4509908Z     @given(
2025-05-07T20:33:08.4510247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4510666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4510986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4511327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4511716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4512013Z     )
2025-05-07T20:33:08.4512370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4512813Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4513071Z         self,
2025-05-07T20:33:08.4513273Z         T: int,
2025-05-07T20:33:08.4513470Z         D: int,
2025-05-07T20:33:08.4513695Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4513977Z         contiguous: bool,
2025-05-07T20:33:08.4514223Z         compiled: bool,
2025-05-07T20:33:08.4514445Z     ) -> None:
2025-05-07T20:33:08.4514667Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4514968Z     
2025-05-07T20:33:08.4515242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4515596Z     
2025-05-07T20:33:08.4515796Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4516087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4516407Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4516666Z         x0 = x[:, :D]
2025-05-07T20:33:08.4516887Z         x1 = x[:, D:]
2025-05-07T20:33:08.4517100Z     
2025-05-07T20:33:08.4517299Z         if contiguous:
2025-05-07T20:33:08.4517531Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4517801Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4518049Z     
2025-05-07T20:33:08.4518241Z         if scale_ub is not None:
2025-05-07T20:33:08.4518523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4518868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4519186Z             )
2025-05-07T20:33:08.4519382Z         else:
2025-05-07T20:33:08.4519609Z             scale_ub_tensor = None
2025-05-07T20:33:08.4519869Z     
2025-05-07T20:33:08.4520104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4520426Z             op = silu_mul_quant
2025-05-07T20:33:08.4520684Z             if compiled:
2025-05-07T20:33:08.4520937Z                 op = torch.compile(op)
2025-05-07T20:33:08.4521246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4521531Z     
2025-05-07T20:33:08.4521730Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4522028Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4522330Z     
2025-05-07T20:33:08.4522566Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4522907Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4523210Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4523535Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4523897Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4524269Z     
2025-05-07T20:33:08.4524481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4524681Z 
2025-05-07T20:33:08.4524786Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4525095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4525448Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4525776Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4526640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4527408Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4527963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4528652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4529353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4530088Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4530890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4531654Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4532391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4533038Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4533646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4534192Z     fn()
2025-05-07T20:33:08.4534716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4535359Z     self.fn.run(
2025-05-07T20:33:08.4535828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4536370Z     kernel = self.compile(
2025-05-07T20:33:08.4536930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4537586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4538022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4538268Z 
2025-05-07T20:33:08.4538483Z self = <triton.compiler.compiler.ASTSource object at 0x7f15915dcc40>
2025-05-07T20:33:08.4539621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4541416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1591056430>}
2025-05-07T20:33:08.4542753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4543775Z context = <triton._C.libtriton.ir.context object at 0x7f1590ed2030>
2025-05-07T20:33:08.4544073Z 
2025-05-07T20:33:08.4544245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4544775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4545244Z                            module_map=module_map)
2025-05-07T20:33:08.4545623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4545993Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4546399Z E       ^
2025-05-07T20:33:08.4546877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4547335Z 
2025-05-07T20:33:08.4547755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4548262Z 
2025-05-07T20:33:08.4548378Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4548931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4549344Z     T=16384,
2025-05-07T20:33:08.4549549Z     D=7168,
2025-05-07T20:33:08.4549747Z     scale_ub=1200.0,
2025-05-07T20:33:08.4549986Z     contiguous=False,
2025-05-07T20:33:08.4550226Z     compiled=False,
2025-05-07T20:33:08.4550435Z )
2025-05-07T20:33:08.4550762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4551278Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.4551565Z 
2025-05-07T20:33:08.4551654Z     @given(
2025-05-07T20:33:08.4551890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4552214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4552531Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4552926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4553272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4553579Z     )
2025-05-07T20:33:08.4553932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4554384Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4554636Z         self,
2025-05-07T20:33:08.4554835Z         T: int,
2025-05-07T20:33:08.4555044Z         D: int,
2025-05-07T20:33:08.4555272Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4555553Z         contiguous: bool,
2025-05-07T20:33:08.4555797Z         compiled: bool,
2025-05-07T20:33:08.4556107Z     ) -> None:
2025-05-07T20:33:08.4556334Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4556582Z     
2025-05-07T20:33:08.4556862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4557216Z     
2025-05-07T20:33:08.4557415Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4557724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4558043Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4558288Z         x0 = x[:, :D]
2025-05-07T20:33:08.4558517Z         x1 = x[:, D:]
2025-05-07T20:33:08.4558733Z     
2025-05-07T20:33:08.4558927Z         if contiguous:
2025-05-07T20:33:08.4559171Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4559440Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4559684Z     
2025-05-07T20:33:08.4559886Z         if scale_ub is not None:
2025-05-07T20:33:08.4560167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4560514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4560832Z             )
2025-05-07T20:33:08.4561035Z         else:
2025-05-07T20:33:08.4569964Z             scale_ub_tensor = None
2025-05-07T20:33:08.4570271Z     
2025-05-07T20:33:08.4570521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4570855Z             op = silu_mul_quant
2025-05-07T20:33:08.4571132Z             if compiled:
2025-05-07T20:33:08.4571390Z                 op = torch.compile(op)
2025-05-07T20:33:08.4571705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4571993Z     
2025-05-07T20:33:08.4572198Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4572371Z 
2025-05-07T20:33:08.4572477Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4572783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4573132Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4573418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4574222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4574931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4575478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4576173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4576853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4577440Z     kernel = self.compile(
2025-05-07T20:33:08.4577990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4578653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4579066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4579299Z 
2025-05-07T20:33:08.4579528Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590f921c0>
2025-05-07T20:33:08.4580618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4582200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590ffe9d0>}
2025-05-07T20:33:08.4583563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4584585Z context = <triton._C.libtriton.ir.context object at 0x7f1590a62370>
2025-05-07T20:33:08.4584873Z 
2025-05-07T20:33:08.4585052Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4585634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4586106Z                            module_map=module_map)
2025-05-07T20:33:08.4586491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4586854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4587129Z E       ^
2025-05-07T20:33:08.4587616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4588067Z 
2025-05-07T20:33:08.4588489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4589007Z 
2025-05-07T20:33:08.4589114Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4589545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4589955Z     T=1,
2025-05-07T20:33:08.4590148Z     D=7168,
2025-05-07T20:33:08.4590362Z     scale_ub=None,
2025-05-07T20:33:08.4590590Z     contiguous=True,
2025-05-07T20:33:08.4590823Z     compiled=True,
2025-05-07T20:33:08.4591041Z )
2025-05-07T20:33:08.4591380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4591867Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4592146Z 
2025-05-07T20:33:08.4592229Z     @given(
2025-05-07T20:33:08.4592477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4592813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4593127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4593471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4593819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4594114Z     )
2025-05-07T20:33:08.4594484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4594937Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4595240Z         self,
2025-05-07T20:33:08.4595461Z         T: int,
2025-05-07T20:33:08.4595676Z         D: int,
2025-05-07T20:33:08.4595903Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4596196Z         contiguous: bool,
2025-05-07T20:33:08.4596448Z         compiled: bool,
2025-05-07T20:33:08.4596683Z     ) -> None:
2025-05-07T20:33:08.4596907Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4597161Z     
2025-05-07T20:33:08.4597449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4597845Z     
2025-05-07T20:33:08.4598049Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4598353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4598675Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4598931Z         x0 = x[:, :D]
2025-05-07T20:33:08.4599164Z         x1 = x[:, D:]
2025-05-07T20:33:08.4599376Z     
2025-05-07T20:33:08.4599578Z         if contiguous:
2025-05-07T20:33:08.4599826Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4600096Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4600352Z     
2025-05-07T20:33:08.4600556Z         if scale_ub is not None:
2025-05-07T20:33:08.4600833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4601182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4601547Z             )
2025-05-07T20:33:08.4601756Z         else:
2025-05-07T20:33:08.4601975Z             scale_ub_tensor = None
2025-05-07T20:33:08.4602246Z     
2025-05-07T20:33:08.4602492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4602820Z             op = silu_mul_quant
2025-05-07T20:33:08.4603091Z             if compiled:
2025-05-07T20:33:08.4603360Z                 op = torch.compile(op)
2025-05-07T20:33:08.4603667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4603963Z     
2025-05-07T20:33:08.4604173Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4604471Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4604829Z     
2025-05-07T20:33:08.4605080Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4605421Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4605733Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4606071Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4606450Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4606767Z     
2025-05-07T20:33:08.4606989Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4607189Z 
2025-05-07T20:33:08.4607304Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4607604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4607953Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4608306Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4609108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4609886Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4610451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4611146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4611837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4612571Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4613334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4614093Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4614872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4615529Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4616142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4616660Z     fn()
2025-05-07T20:33:08.4617183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4617797Z     self.fn.run(
2025-05-07T20:33:08.4618272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4618805Z     kernel = self.compile(
2025-05-07T20:33:08.4619343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4620001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4620411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4620644Z 
2025-05-07T20:33:08.4620859Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590c3bc70>
2025-05-07T20:33:08.4622086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4623488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590ffef70>}
2025-05-07T20:33:08.4624840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4625865Z context = <triton._C.libtriton.ir.context object at 0x7f1590b7bb30>
2025-05-07T20:33:08.4626160Z 
2025-05-07T20:33:08.4626387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4626909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4627383Z                            module_map=module_map)
2025-05-07T20:33:08.4627775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4628137Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4628416Z E       ^
2025-05-07T20:33:08.4628953Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4629404Z 
2025-05-07T20:33:08.4629829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4630337Z 
2025-05-07T20:33:08.4630446Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4630870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4631288Z     T=4096,
2025-05-07T20:33:08.4631486Z     D=5120,
2025-05-07T20:33:08.4631689Z     scale_ub=None,
2025-05-07T20:33:08.4631917Z     contiguous=False,
2025-05-07T20:33:08.4632149Z     compiled=False,
2025-05-07T20:33:08.4632367Z )
2025-05-07T20:33:08.4632695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4633199Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.4633487Z 
2025-05-07T20:33:08.4633569Z     @given(
2025-05-07T20:33:08.4633809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4634141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4634455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4634799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4635138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4635435Z     )
2025-05-07T20:33:08.4635877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4636343Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4636586Z         self,
2025-05-07T20:33:08.4636792Z         T: int,
2025-05-07T20:33:08.4637001Z         D: int,
2025-05-07T20:33:08.4637224Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4637511Z         contiguous: bool,
2025-05-07T20:33:08.4637759Z         compiled: bool,
2025-05-07T20:33:08.4637993Z     ) -> None:
2025-05-07T20:33:08.4638216Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4638522Z     
2025-05-07T20:33:08.4638807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4639150Z     
2025-05-07T20:33:08.4639355Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4639657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4639973Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4640541Z         x0 = x[:, :D]
2025-05-07T20:33:08.4640773Z         x1 = x[:, D:]
2025-05-07T20:33:08.4640988Z     
2025-05-07T20:33:08.4641189Z         if contiguous:
2025-05-07T20:33:08.4641433Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4641697Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4641948Z     
2025-05-07T20:33:08.4642153Z         if scale_ub is not None:
2025-05-07T20:33:08.4642430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4642911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4643237Z             )
2025-05-07T20:33:08.4643445Z         else:
2025-05-07T20:33:08.4643660Z             scale_ub_tensor = None
2025-05-07T20:33:08.4643922Z     
2025-05-07T20:33:08.4644170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4644487Z             op = silu_mul_quant
2025-05-07T20:33:08.4644748Z             if compiled:
2025-05-07T20:33:08.4645007Z                 op = torch.compile(op)
2025-05-07T20:33:08.4645306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4645590Z     
2025-05-07T20:33:08.4645876Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4646061Z 
2025-05-07T20:33:08.4646166Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4646472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4646814Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4647106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4647798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4648493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4649037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4649727Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4650395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4650939Z     kernel = self.compile(
2025-05-07T20:33:08.4651479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4652137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4652543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4652775Z 
2025-05-07T20:33:08.4652992Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590d6e460>
2025-05-07T20:33:08.4654069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4655435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590c1fca0>}
2025-05-07T20:33:08.4656839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4657882Z context = <triton._C.libtriton.ir.context object at 0x7f1590aa4b70>
2025-05-07T20:33:08.4658170Z 
2025-05-07T20:33:08.4658346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4658884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4659418Z                            module_map=module_map)
2025-05-07T20:33:08.4659797Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4660156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4660427Z E       ^
2025-05-07T20:33:08.4660895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4661450Z 
2025-05-07T20:33:08.4661886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4662393Z 
2025-05-07T20:33:08.4662501Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4662919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4663376Z     T=4096,
2025-05-07T20:33:08.4663573Z     D=7168,
2025-05-07T20:33:08.4663774Z     scale_ub=None,
2025-05-07T20:33:08.4664002Z     contiguous=False,
2025-05-07T20:33:08.4664234Z     compiled=False,
2025-05-07T20:33:08.4664448Z )
2025-05-07T20:33:08.4664773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4665271Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.4665557Z 
2025-05-07T20:33:08.4665640Z     @given(
2025-05-07T20:33:08.4665879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4666205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4666570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4666909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4667245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4667533Z     )
2025-05-07T20:33:08.4667897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4668344Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4668589Z         self,
2025-05-07T20:33:08.4668799Z         T: int,
2025-05-07T20:33:08.4669010Z         D: int,
2025-05-07T20:33:08.4669234Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4669516Z         contiguous: bool,
2025-05-07T20:33:08.4669769Z         compiled: bool,
2025-05-07T20:33:08.4670002Z     ) -> None:
2025-05-07T20:33:08.4670223Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4670479Z     
2025-05-07T20:33:08.4670759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4671107Z     
2025-05-07T20:33:08.4671315Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4671621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4671937Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4672189Z         x0 = x[:, :D]
2025-05-07T20:33:08.4672420Z         x1 = x[:, D:]
2025-05-07T20:33:08.4672631Z     
2025-05-07T20:33:08.4672831Z         if contiguous:
2025-05-07T20:33:08.4673075Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4673337Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4673594Z     
2025-05-07T20:33:08.4673795Z         if scale_ub is not None:
2025-05-07T20:33:08.4674073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4674418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4674735Z             )
2025-05-07T20:33:08.4674931Z         else:
2025-05-07T20:33:08.4675151Z             scale_ub_tensor = None
2025-05-07T20:33:08.4675417Z     
2025-05-07T20:33:08.4675708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4676035Z             op = silu_mul_quant
2025-05-07T20:33:08.4676297Z             if compiled:
2025-05-07T20:33:08.4676557Z                 op = torch.compile(op)
2025-05-07T20:33:08.4676856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4677138Z     
2025-05-07T20:33:08.4677343Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4677513Z 
2025-05-07T20:33:08.4677616Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4677972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4678315Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4678616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4679341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4680034Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4680581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4681266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4681929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4682464Z     kernel = self.compile(
2025-05-07T20:33:08.4683053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4683723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4684127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4684360Z 
2025-05-07T20:33:08.4684580Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590c38340>
2025-05-07T20:33:08.4685659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4687089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590c7a700>}
2025-05-07T20:33:08.4688432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4689454Z context = <triton._C.libtriton.ir.context object at 0x7f15905b15b0>
2025-05-07T20:33:08.4689743Z 
2025-05-07T20:33:08.4689922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4690446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4690922Z                            module_map=module_map)
2025-05-07T20:33:08.4691304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4691665Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4691939Z E       ^
2025-05-07T20:33:08.4692412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4692858Z 
2025-05-07T20:33:08.4693281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4693791Z 
2025-05-07T20:33:08.4693899Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4694321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4694734Z     T=128,
2025-05-07T20:33:08.4694941Z     D=7168,
2025-05-07T20:33:08.4695137Z     scale_ub=None,
2025-05-07T20:33:08.4695362Z     contiguous=False,
2025-05-07T20:33:08.4695603Z     compiled=True,
2025-05-07T20:33:08.4695809Z )
2025-05-07T20:33:08.4696139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4696685Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.4696958Z 
2025-05-07T20:33:08.4697038Z     @given(
2025-05-07T20:33:08.4697280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4697603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4697917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4698256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4698678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4698993Z     )
2025-05-07T20:33:08.4699346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4699796Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4700048Z         self,
2025-05-07T20:33:08.4700246Z         T: int,
2025-05-07T20:33:08.4700456Z         D: int,
2025-05-07T20:33:08.4700685Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4700962Z         contiguous: bool,
2025-05-07T20:33:08.4701355Z         compiled: bool,
2025-05-07T20:33:08.4701591Z     ) -> None:
2025-05-07T20:33:08.4701813Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4702065Z     
2025-05-07T20:33:08.4702350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4702694Z     
2025-05-07T20:33:08.4702991Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4703300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4703617Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4703873Z         x0 = x[:, :D]
2025-05-07T20:33:08.4704102Z         x1 = x[:, D:]
2025-05-07T20:33:08.4704321Z     
2025-05-07T20:33:08.4704511Z         if contiguous:
2025-05-07T20:33:08.4704753Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4705024Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4705272Z     
2025-05-07T20:33:08.4705477Z         if scale_ub is not None:
2025-05-07T20:33:08.4705763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4706182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4706504Z             )
2025-05-07T20:33:08.4706708Z         else:
2025-05-07T20:33:08.4706924Z             scale_ub_tensor = None
2025-05-07T20:33:08.4707191Z     
2025-05-07T20:33:08.4707439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4707760Z             op = silu_mul_quant
2025-05-07T20:33:08.4708024Z             if compiled:
2025-05-07T20:33:08.4708289Z                 op = torch.compile(op)
2025-05-07T20:33:08.4708594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4708877Z     
2025-05-07T20:33:08.4709084Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4709375Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4709677Z     
2025-05-07T20:33:08.4709925Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4710273Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4710577Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4710902Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4711269Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4711581Z     
2025-05-07T20:33:08.4711795Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4711996Z 
2025-05-07T20:33:08.4712111Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4712410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4712759Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4713102Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4713890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4714638Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4715237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4715937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4716627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4717357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4718117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4718916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4719647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4720290Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4720899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4721422Z     fn()
2025-05-07T20:33:08.4721933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4722513Z     self.fn.run(
2025-05-07T20:33:08.4723025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4723562Z     kernel = self.compile(
2025-05-07T20:33:08.4724115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4737303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4737731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4737968Z 
2025-05-07T20:33:08.4738182Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590566550>
2025-05-07T20:33:08.4739291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4741196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590a2f5e0>}
2025-05-07T20:33:08.4742560Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4743581Z context = <triton._C.libtriton.ir.context object at 0x7f15904cd870>
2025-05-07T20:33:08.4743885Z 
2025-05-07T20:33:08.4744058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4744600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4745092Z                            module_map=module_map)
2025-05-07T20:33:08.4745468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4745839Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4746126Z E       ^
2025-05-07T20:33:08.4746597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4747060Z 
2025-05-07T20:33:08.4747485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4748016Z 
2025-05-07T20:33:08.4748124Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4748553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4748965Z     T=128,
2025-05-07T20:33:08.4749171Z     D=7168,
2025-05-07T20:33:08.4749376Z     scale_ub=None,
2025-05-07T20:33:08.4749596Z     contiguous=False,
2025-05-07T20:33:08.4749839Z     compiled=False,
2025-05-07T20:33:08.4750064Z )
2025-05-07T20:33:08.4750549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4751059Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.4751338Z 
2025-05-07T20:33:08.4751421Z     @given(
2025-05-07T20:33:08.4751666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4751984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4752305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4752712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4753050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4753350Z     )
2025-05-07T20:33:08.4753714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4754158Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4754413Z         self,
2025-05-07T20:33:08.4754619Z         T: int,
2025-05-07T20:33:08.4754818Z         D: int,
2025-05-07T20:33:08.4755055Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4755340Z         contiguous: bool,
2025-05-07T20:33:08.4755591Z         compiled: bool,
2025-05-07T20:33:08.4755820Z     ) -> None:
2025-05-07T20:33:08.4756044Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4756296Z     
2025-05-07T20:33:08.4756642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4757002Z     
2025-05-07T20:33:08.4757203Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4757503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4757829Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4758086Z         x0 = x[:, :D]
2025-05-07T20:33:08.4758309Z         x1 = x[:, D:]
2025-05-07T20:33:08.4758539Z     
2025-05-07T20:33:08.4758777Z         if contiguous:
2025-05-07T20:33:08.4759021Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4759296Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4759548Z     
2025-05-07T20:33:08.4759816Z         if scale_ub is not None:
2025-05-07T20:33:08.4760107Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4760454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4760779Z             )
2025-05-07T20:33:08.4760982Z         else:
2025-05-07T20:33:08.4761211Z             scale_ub_tensor = None
2025-05-07T20:33:08.4761482Z     
2025-05-07T20:33:08.4761719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4762048Z             op = silu_mul_quant
2025-05-07T20:33:08.4762314Z             if compiled:
2025-05-07T20:33:08.4762572Z                 op = torch.compile(op)
2025-05-07T20:33:08.4762881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4763168Z     
2025-05-07T20:33:08.4763367Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4763546Z 
2025-05-07T20:33:08.4763652Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4763963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4764304Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4764604Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4765316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4766013Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4766558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4767260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4767926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4768464Z     kernel = self.compile(
2025-05-07T20:33:08.4769011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4769722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4770141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4770375Z 
2025-05-07T20:33:08.4770586Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590667b80>
2025-05-07T20:33:08.4771669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4773104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15906d2ee0>}
2025-05-07T20:33:08.4774452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4775475Z context = <triton._C.libtriton.ir.context object at 0x7f1590059d70>
2025-05-07T20:33:08.4775767Z 
2025-05-07T20:33:08.4775941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4776471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4776988Z                            module_map=module_map)
2025-05-07T20:33:08.4777363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4777732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4778001Z E       ^
2025-05-07T20:33:08.4778473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4778929Z 
2025-05-07T20:33:08.4779350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4779862Z 
2025-05-07T20:33:08.4779969Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4780437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4780848Z     T=4096,
2025-05-07T20:33:08.4781107Z     D=5120,
2025-05-07T20:33:08.4781317Z     scale_ub=1200.0,
2025-05-07T20:33:08.4781551Z     contiguous=True,
2025-05-07T20:33:08.4781779Z     compiled=False,
2025-05-07T20:33:08.4782001Z )
2025-05-07T20:33:08.4782332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4782834Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.4783124Z 
2025-05-07T20:33:08.4783205Z     @given(
2025-05-07T20:33:08.4783449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4783769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4784087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4784430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4784773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4785079Z     )
2025-05-07T20:33:08.4785436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4785874Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4786129Z         self,
2025-05-07T20:33:08.4786331Z         T: int,
2025-05-07T20:33:08.4786533Z         D: int,
2025-05-07T20:33:08.4786764Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4787044Z         contiguous: bool,
2025-05-07T20:33:08.4787286Z         compiled: bool,
2025-05-07T20:33:08.4787522Z     ) -> None:
2025-05-07T20:33:08.4787744Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4787990Z     
2025-05-07T20:33:08.4788272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4788619Z     
2025-05-07T20:33:08.4788818Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4789116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4789438Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4789745Z         x0 = x[:, :D]
2025-05-07T20:33:08.4789967Z         x1 = x[:, D:]
2025-05-07T20:33:08.4790182Z     
2025-05-07T20:33:08.4790375Z         if contiguous:
2025-05-07T20:33:08.4790608Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4790873Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4791120Z     
2025-05-07T20:33:08.4791315Z         if scale_ub is not None:
2025-05-07T20:33:08.4791594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4791936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4792291Z             )
2025-05-07T20:33:08.4792496Z         else:
2025-05-07T20:33:08.4792712Z             scale_ub_tensor = None
2025-05-07T20:33:08.4792965Z     
2025-05-07T20:33:08.4793204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4793527Z             op = silu_mul_quant
2025-05-07T20:33:08.4793780Z             if compiled:
2025-05-07T20:33:08.4794042Z                 op = torch.compile(op)
2025-05-07T20:33:08.4794356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4794638Z     
2025-05-07T20:33:08.4794835Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4795009Z 
2025-05-07T20:33:08.4795110Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4795412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4795817Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4796111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4796813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4797502Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4798048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4798732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4799399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4799977Z     kernel = self.compile(
2025-05-07T20:33:08.4800364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4800546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4800685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4800692Z 
2025-05-07T20:33:08.4800900Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590075f10>
2025-05-07T20:33:08.4801673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4802187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15907a9670>}
2025-05-07T20:33:08.4802933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4803135Z context = <triton._C.libtriton.ir.context object at 0x7f15904bd1f0>
2025-05-07T20:33:08.4803140Z 
2025-05-07T20:33:08.4803311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4803582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4803701Z                            module_map=module_map)
2025-05-07T20:33:08.4803867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4803974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4804058Z E       ^
2025-05-07T20:33:08.4804454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4804462Z 
2025-05-07T20:33:08.4804888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4804893Z 
2025-05-07T20:33:08.4804998Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4805232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4805313Z     T=1,
2025-05-07T20:33:08.4805430Z     D=5120,
2025-05-07T20:33:08.4805523Z     scale_ub=None,
2025-05-07T20:33:08.4805611Z     contiguous=True,
2025-05-07T20:33:08.4805696Z     compiled=True,
2025-05-07T20:33:08.4805779Z )
2025-05-07T20:33:08.4805999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4806165Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4806170Z 
2025-05-07T20:33:08.4806257Z     @given(
2025-05-07T20:33:08.4806382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4806487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4806616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4806738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4806861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4806978Z     )
2025-05-07T20:33:08.4807227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4807335Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4807417Z         self,
2025-05-07T20:33:08.4807496Z         T: int,
2025-05-07T20:33:08.4807580Z         D: int,
2025-05-07T20:33:08.4807680Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4807771Z         contiguous: bool,
2025-05-07T20:33:08.4807868Z         compiled: bool,
2025-05-07T20:33:08.4807949Z     ) -> None:
2025-05-07T20:33:08.4808053Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4808129Z     
2025-05-07T20:33:08.4808306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4808427Z     
2025-05-07T20:33:08.4808521Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4808653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4808750Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4808833Z         x0 = x[:, :D]
2025-05-07T20:33:08.4808917Z         x1 = x[:, D:]
2025-05-07T20:33:08.4808999Z     
2025-05-07T20:33:08.4809086Z         if contiguous:
2025-05-07T20:33:08.4809182Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4809282Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4809356Z     
2025-05-07T20:33:08.4809452Z         if scale_ub is not None:
2025-05-07T20:33:08.4809565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4809715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4809794Z             )
2025-05-07T20:33:08.4809882Z         else:
2025-05-07T20:33:08.4809978Z             scale_ub_tensor = None
2025-05-07T20:33:08.4810061Z     
2025-05-07T20:33:08.4810202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4810296Z             op = silu_mul_quant
2025-05-07T20:33:08.4810387Z             if compiled:
2025-05-07T20:33:08.4810503Z                 op = torch.compile(op)
2025-05-07T20:33:08.4810612Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4810695Z     
2025-05-07T20:33:08.4810790Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4810916Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4810998Z     
2025-05-07T20:33:08.4811139Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4811245Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4811354Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4811478Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4811619Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4811706Z     
2025-05-07T20:33:08.4811867Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4811872Z 
2025-05-07T20:33:08.4811985Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4812116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4812225Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4812375Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4812939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4813087Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4813453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4813679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4814053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4814318Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4814711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4815011Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4815392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4815570Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4815912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4815992Z     fn()
2025-05-07T20:33:08.4816391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4816521Z     self.fn.run(
2025-05-07T20:33:08.4816855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4816957Z     kernel = self.compile(
2025-05-07T20:33:08.4817333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4817520Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4817650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4817657Z 
2025-05-07T20:33:08.4817864Z self = <triton.compiler.compiler.ASTSource object at 0x7f159048b280>
2025-05-07T20:33:08.4818643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4819147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15903f6550>}
2025-05-07T20:33:08.4819898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4820092Z context = <triton._C.libtriton.ir.context object at 0x7f15903dedb0>
2025-05-07T20:33:08.4820099Z 
2025-05-07T20:33:08.4820276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4820541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4820650Z                            module_map=module_map)
2025-05-07T20:33:08.4820822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4820926Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4821074Z E       ^
2025-05-07T20:33:08.4821486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4821491Z 
2025-05-07T20:33:08.4821909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4821914Z 
2025-05-07T20:33:08.4822027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4822251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4822368Z     T=2048,
2025-05-07T20:33:08.4822458Z     D=5120,
2025-05-07T20:33:08.4822542Z     scale_ub=None,
2025-05-07T20:33:08.4822630Z     contiguous=True,
2025-05-07T20:33:08.4822723Z     compiled=True,
2025-05-07T20:33:08.4822798Z )
2025-05-07T20:33:08.4823019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4823198Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4823206Z 
2025-05-07T20:33:08.4823289Z     @given(
2025-05-07T20:33:08.4823417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4823520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4823638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4823804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4823922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4823998Z     )
2025-05-07T20:33:08.4824258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4824355Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4824435Z         self,
2025-05-07T20:33:08.4824521Z         T: int,
2025-05-07T20:33:08.4824601Z         D: int,
2025-05-07T20:33:08.4824709Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4824801Z         contiguous: bool,
2025-05-07T20:33:08.4824891Z         compiled: bool,
2025-05-07T20:33:08.4824976Z     ) -> None:
2025-05-07T20:33:08.4825117Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4825193Z     
2025-05-07T20:33:08.4825373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4825451Z     
2025-05-07T20:33:08.4825546Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4825681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4825776Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4825860Z         x0 = x[:, :D]
2025-05-07T20:33:08.4825950Z         x1 = x[:, D:]
2025-05-07T20:33:08.4826031Z     
2025-05-07T20:33:08.4826122Z         if contiguous:
2025-05-07T20:33:08.4826226Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4826315Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4826401Z     
2025-05-07T20:33:08.4826495Z         if scale_ub is not None:
2025-05-07T20:33:08.4826602Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4826750Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4826829Z             )
2025-05-07T20:33:08.4826915Z         else:
2025-05-07T20:33:08.4827022Z             scale_ub_tensor = None
2025-05-07T20:33:08.4827099Z     
2025-05-07T20:33:08.4827234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4827340Z             op = silu_mul_quant
2025-05-07T20:33:08.4827430Z             if compiled:
2025-05-07T20:33:08.4827535Z                 op = torch.compile(op)
2025-05-07T20:33:08.4827653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4827732Z     
2025-05-07T20:33:08.4827835Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4827959Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4828034Z     
2025-05-07T20:33:08.4828180Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4828287Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4828390Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4828525Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4828720Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4828797Z     
2025-05-07T20:33:08.4828910Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4828914Z 
2025-05-07T20:33:08.4829014Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4829157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4829265Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4829402Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4830034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4830139Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4830497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4830735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4831112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4831376Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4831807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4832067Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4832454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4832625Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4832973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4833054Z     fn()
2025-05-07T20:33:08.4833453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4833586Z     self.fn.run(
2025-05-07T20:33:08.4833925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4834022Z     kernel = self.compile(
2025-05-07T20:33:08.4834410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4834596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4834735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4834740Z 
2025-05-07T20:33:08.4834947Z self = <triton.compiler.compiler.ASTSource object at 0x7f159044b160>
2025-05-07T20:33:08.4835722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4836243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590015f70>}
2025-05-07T20:33:08.4836985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4837186Z context = <triton._C.libtriton.ir.context object at 0x7f159018edb0>
2025-05-07T20:33:08.4837191Z 
2025-05-07T20:33:08.4837359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4837630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4837742Z                            module_map=module_map)
2025-05-07T20:33:08.4837907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4838062Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4838145Z E       ^
2025-05-07T20:33:08.4838500Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4838504Z 
2025-05-07T20:33:08.4838931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4838935Z 
2025-05-07T20:33:08.4839078Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4839308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4839389Z     T=128,
2025-05-07T20:33:08.4839469Z     D=5120,
2025-05-07T20:33:08.4839560Z     scale_ub=None,
2025-05-07T20:33:08.4839648Z     contiguous=True,
2025-05-07T20:33:08.4839733Z     compiled=True,
2025-05-07T20:33:08.4839815Z )
2025-05-07T20:33:08.4840033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4840491Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4840508Z 
2025-05-07T20:33:08.4840618Z     @given(
2025-05-07T20:33:08.4840766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4840876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4841134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4841258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4841384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4841465Z     )
2025-05-07T20:33:08.4841714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4841816Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4841894Z         self,
2025-05-07T20:33:08.4841974Z         T: int,
2025-05-07T20:33:08.4842059Z         D: int,
2025-05-07T20:33:08.4842159Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4842259Z         contiguous: bool,
2025-05-07T20:33:08.4842420Z         compiled: bool,
2025-05-07T20:33:08.4842502Z     ) -> None:
2025-05-07T20:33:08.4842604Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4842679Z     
2025-05-07T20:33:08.4842850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4842932Z     
2025-05-07T20:33:08.4843033Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4843162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4843259Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4843346Z         x0 = x[:, :D]
2025-05-07T20:33:08.4843431Z         x1 = x[:, D:]
2025-05-07T20:33:08.4843515Z     
2025-05-07T20:33:08.4843602Z         if contiguous:
2025-05-07T20:33:08.4843698Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4843798Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4843878Z     
2025-05-07T20:33:08.4843977Z         if scale_ub is not None:
2025-05-07T20:33:08.4844086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4844232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4844316Z             )
2025-05-07T20:33:08.4844398Z         else:
2025-05-07T20:33:08.4844494Z             scale_ub_tensor = None
2025-05-07T20:33:08.4844578Z     
2025-05-07T20:33:08.4844712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4844808Z             op = silu_mul_quant
2025-05-07T20:33:08.4844908Z             if compiled:
2025-05-07T20:33:08.4845013Z                 op = torch.compile(op)
2025-05-07T20:33:08.4845123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4845204Z     
2025-05-07T20:33:08.4845299Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4845429Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4845503Z     
2025-05-07T20:33:08.4845642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4845754Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4845858Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4846056Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4846211Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4846287Z     
2025-05-07T20:33:08.4846391Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4846396Z 
2025-05-07T20:33:08.4846505Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4846635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4846815Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4846952Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4847510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4847620Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4847977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4848213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4848594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4848893Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4849302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4849563Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4849932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4850108Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4850446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4850573Z     fn()
2025-05-07T20:33:08.4850974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4851058Z     self.fn.run(
2025-05-07T20:33:08.4851400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4851496Z     kernel = self.compile(
2025-05-07T20:33:08.4851871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4852059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4852188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4852193Z 
2025-05-07T20:33:08.4852409Z self = <triton.compiler.compiler.ASTSource object at 0x7f1590297fd0>
2025-05-07T20:33:08.4853181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4853699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590293b80>}
2025-05-07T20:33:08.4854450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4854648Z context = <triton._C.libtriton.ir.context object at 0x7f158fd4b6f0>
2025-05-07T20:33:08.4854652Z 
2025-05-07T20:33:08.4854827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4855091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4855209Z                            module_map=module_map)
2025-05-07T20:33:08.4855418Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4855524Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4855609Z E       ^
2025-05-07T20:33:08.4855964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4855971Z 
2025-05-07T20:33:08.4856381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4856430Z 
2025-05-07T20:33:08.4856536Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4856761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4856848Z     T=4096,
2025-05-07T20:33:08.4856927Z     D=5120,
2025-05-07T20:33:08.4857013Z     scale_ub=None,
2025-05-07T20:33:08.4857105Z     contiguous=True,
2025-05-07T20:33:08.4857193Z     compiled=True,
2025-05-07T20:33:08.4857269Z )
2025-05-07T20:33:08.4857502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4857676Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4857681Z 
2025-05-07T20:33:08.4857760Z     @given(
2025-05-07T20:33:08.4857886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4858027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4858154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4858278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4858394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4858477Z     )
2025-05-07T20:33:08.4858724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4858821Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4858907Z         self,
2025-05-07T20:33:08.4858987Z         T: int,
2025-05-07T20:33:08.4859068Z         D: int,
2025-05-07T20:33:08.4859226Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4859318Z         contiguous: bool,
2025-05-07T20:33:08.4859417Z         compiled: bool,
2025-05-07T20:33:08.4859498Z     ) -> None:
2025-05-07T20:33:08.4859596Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4859677Z     
2025-05-07T20:33:08.4859852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4859930Z     
2025-05-07T20:33:08.4860035Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4860166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4860258Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4860349Z         x0 = x[:, :D]
2025-05-07T20:33:08.4860433Z         x1 = x[:, D:]
2025-05-07T20:33:08.4860512Z     
2025-05-07T20:33:08.4860608Z         if contiguous:
2025-05-07T20:33:08.4860703Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4860795Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4860877Z     
2025-05-07T20:33:08.4860970Z         if scale_ub is not None:
2025-05-07T20:33:08.4861165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4861307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4861388Z             )
2025-05-07T20:33:08.4861479Z         else:
2025-05-07T20:33:08.4861576Z             scale_ub_tensor = None
2025-05-07T20:33:08.4861651Z     
2025-05-07T20:33:08.4861795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4861889Z             op = silu_mul_quant
2025-05-07T20:33:08.4861980Z             if compiled:
2025-05-07T20:33:08.4862091Z                 op = torch.compile(op)
2025-05-07T20:33:08.4862202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4862278Z     
2025-05-07T20:33:08.4862379Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4862505Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4862589Z     
2025-05-07T20:33:08.4862729Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4862904Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4863016Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4863146Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4863290Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4863374Z     
2025-05-07T20:33:08.4863479Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4863484Z 
2025-05-07T20:33:08.4863585Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4863760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4863870Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4864016Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4864570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4864673Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4865048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4865274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4865683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4865948Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4866350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4866614Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4866984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4867153Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4867551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4867632Z     fn()
2025-05-07T20:33:08.4868033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4868120Z     self.fn.run(
2025-05-07T20:33:08.4868454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4868560Z     kernel = self.compile(
2025-05-07T20:33:08.4868933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4869111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4869253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4869257Z 
2025-05-07T20:33:08.4869464Z self = <triton.compiler.compiler.ASTSource object at 0x7f158fcdfb50>
2025-05-07T20:33:08.4870244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4870751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158fddeca0>}
2025-05-07T20:33:08.4871512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4871706Z context = <triton._C.libtriton.ir.context object at 0x7f158f975c30>
2025-05-07T20:33:08.4871711Z 
2025-05-07T20:33:08.4871878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4872188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4872304Z                            module_map=module_map)
2025-05-07T20:33:08.4872476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4872580Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4872659Z E       ^
2025-05-07T20:33:08.4873019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4873024Z 
2025-05-07T20:33:08.4873479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4873484Z 
2025-05-07T20:33:08.4873589Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4873819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4873901Z     T=16384,
2025-05-07T20:33:08.4873985Z     D=5120,
2025-05-07T20:33:08.4874069Z     scale_ub=None,
2025-05-07T20:33:08.4874155Z     contiguous=True,
2025-05-07T20:33:08.4874253Z     compiled=True,
2025-05-07T20:33:08.4874334Z )
2025-05-07T20:33:08.4874554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4874735Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.4874740Z 
2025-05-07T20:33:08.4874862Z     @given(
2025-05-07T20:33:08.4874984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4875092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4875212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4875338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4875453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4875530Z     )
2025-05-07T20:33:08.4875790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4875893Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4881702Z         self,
2025-05-07T20:33:08.4881867Z         T: int,
2025-05-07T20:33:08.4881956Z         D: int,
2025-05-07T20:33:08.4882062Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4882155Z         contiguous: bool,
2025-05-07T20:33:08.4882255Z         compiled: bool,
2025-05-07T20:33:08.4882338Z     ) -> None:
2025-05-07T20:33:08.4882437Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4882524Z     
2025-05-07T20:33:08.4882707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4882788Z     
2025-05-07T20:33:08.4882894Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4883023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4883117Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4883207Z         x0 = x[:, :D]
2025-05-07T20:33:08.4883291Z         x1 = x[:, D:]
2025-05-07T20:33:08.4883374Z     
2025-05-07T20:33:08.4883460Z         if contiguous:
2025-05-07T20:33:08.4883555Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4883655Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4883735Z     
2025-05-07T20:33:08.4883831Z         if scale_ub is not None:
2025-05-07T20:33:08.4883949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4884089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4884172Z             )
2025-05-07T20:33:08.4884265Z         else:
2025-05-07T20:33:08.4884364Z             scale_ub_tensor = None
2025-05-07T20:33:08.4884439Z     
2025-05-07T20:33:08.4884584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4884680Z             op = silu_mul_quant
2025-05-07T20:33:08.4884770Z             if compiled:
2025-05-07T20:33:08.4884883Z                 op = torch.compile(op)
2025-05-07T20:33:08.4884994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4885079Z     
2025-05-07T20:33:08.4885174Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4885298Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4885381Z     
2025-05-07T20:33:08.4885570Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4885678Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4885791Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4885916Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4886068Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4886152Z     
2025-05-07T20:33:08.4886258Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4886301Z 
2025-05-07T20:33:08.4886412Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4886545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4886655Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4886803Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4887360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4887472Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4887845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4888074Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4888490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4888759Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4889161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4889428Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4889807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4890028Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4890376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4890456Z     fn()
2025-05-07T20:33:08.4890871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4890956Z     self.fn.run(
2025-05-07T20:33:08.4891296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4891403Z     kernel = self.compile(
2025-05-07T20:33:08.4891779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4891970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4892099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4892107Z 
2025-05-07T20:33:08.4892319Z self = <triton.compiler.compiler.ASTSource object at 0x7f159027a550>
2025-05-07T20:33:08.4893104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4893618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1590293ca0>}
2025-05-07T20:33:08.4894385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4894580Z context = <triton._C.libtriton.ir.context object at 0x7f158f5c9930>
2025-05-07T20:33:08.4894585Z 
2025-05-07T20:33:08.4894807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4895083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4895194Z                            module_map=module_map)
2025-05-07T20:33:08.4895367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4895474Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4895553Z E       ^
2025-05-07T20:33:08.4895922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4895967Z 
2025-05-07T20:33:08.4896387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4896393Z 
2025-05-07T20:33:08.4896507Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4896733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4896813Z     T=1,
2025-05-07T20:33:08.4896902Z     D=5120,
2025-05-07T20:33:08.4896994Z     scale_ub=1200.0,
2025-05-07T20:33:08.4897084Z     contiguous=True,
2025-05-07T20:33:08.4897177Z     compiled=True,
2025-05-07T20:33:08.4897257Z )
2025-05-07T20:33:08.4897477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4897692Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.4897698Z 
2025-05-07T20:33:08.4897779Z     @given(
2025-05-07T20:33:08.4897913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4898015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4898133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4898265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4898383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4898460Z     )
2025-05-07T20:33:08.4898716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4898883Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4898965Z         self,
2025-05-07T20:33:08.4899056Z         T: int,
2025-05-07T20:33:08.4899137Z         D: int,
2025-05-07T20:33:08.4899249Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4899342Z         contiguous: bool,
2025-05-07T20:33:08.4899433Z         compiled: bool,
2025-05-07T20:33:08.4899525Z     ) -> None:
2025-05-07T20:33:08.4899625Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4899701Z     
2025-05-07T20:33:08.4899887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4899965Z     
2025-05-07T20:33:08.4900061Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4900202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4900294Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4900378Z         x0 = x[:, :D]
2025-05-07T20:33:08.4900472Z         x1 = x[:, D:]
2025-05-07T20:33:08.4900548Z     
2025-05-07T20:33:08.4900635Z         if contiguous:
2025-05-07T20:33:08.4900746Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4900839Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4900922Z     
2025-05-07T20:33:08.4901150Z         if scale_ub is not None:
2025-05-07T20:33:08.4901261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4901413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4901494Z             )
2025-05-07T20:33:08.4901577Z         else:
2025-05-07T20:33:08.4901684Z             scale_ub_tensor = None
2025-05-07T20:33:08.4901762Z     
2025-05-07T20:33:08.4901899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4902000Z             op = silu_mul_quant
2025-05-07T20:33:08.4902088Z             if compiled:
2025-05-07T20:33:08.4902192Z                 op = torch.compile(op)
2025-05-07T20:33:08.4902309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4902385Z     
2025-05-07T20:33:08.4902486Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4902494Z 
2025-05-07T20:33:08.4902642Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4902776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4902891Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4902995Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4903362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.4903465Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.4904005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4904114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4904476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4904708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4905058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4905157Z     kernel = self.compile(
2025-05-07T20:33:08.4905541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4905763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4905893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4905900Z 
2025-05-07T20:33:08.4906120Z self = <triton.compiler.compiler.ASTSource object at 0x7f158fcc7040>
2025-05-07T20:33:08.4906907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4907425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f6568b0>}
2025-05-07T20:33:08.4908222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4908419Z context = <triton._C.libtriton.ir.context object at 0x7f158eff37f0>
2025-05-07T20:33:08.4908424Z 
2025-05-07T20:33:08.4908599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4908871Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4908993Z                            module_map=module_map)
2025-05-07T20:33:08.4909161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4909265Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4909354Z E       ^
2025-05-07T20:33:08.4909717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4909725Z 
2025-05-07T20:33:08.4910136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4910152Z 
2025-05-07T20:33:08.4910262Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4910490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4910578Z     T=1,
2025-05-07T20:33:08.4910658Z     D=5120,
2025-05-07T20:33:08.4910742Z     scale_ub=None,
2025-05-07T20:33:08.4910840Z     contiguous=False,
2025-05-07T20:33:08.4910925Z     compiled=True,
2025-05-07T20:33:08.4911001Z )
2025-05-07T20:33:08.4911230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4911399Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.4911404Z 
2025-05-07T20:33:08.4911485Z     @given(
2025-05-07T20:33:08.4911657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4911759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4911883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4912003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4912123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4912207Z     )
2025-05-07T20:33:08.4912454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4912611Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4912698Z         self,
2025-05-07T20:33:08.4912779Z         T: int,
2025-05-07T20:33:08.4912860Z         D: int,
2025-05-07T20:33:08.4912973Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4913068Z         contiguous: bool,
2025-05-07T20:33:08.4913166Z         compiled: bool,
2025-05-07T20:33:08.4913249Z     ) -> None:
2025-05-07T20:33:08.4913349Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4913434Z     
2025-05-07T20:33:08.4913615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4913693Z     
2025-05-07T20:33:08.4913796Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4913927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4914019Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4914156Z         x0 = x[:, :D]
2025-05-07T20:33:08.4914243Z         x1 = x[:, D:]
2025-05-07T20:33:08.4914320Z     
2025-05-07T20:33:08.4914419Z         if contiguous:
2025-05-07T20:33:08.4914515Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4914610Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4914695Z     
2025-05-07T20:33:08.4914792Z         if scale_ub is not None:
2025-05-07T20:33:08.4914910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4915052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4915132Z             )
2025-05-07T20:33:08.4915226Z         else:
2025-05-07T20:33:08.4915375Z             scale_ub_tensor = None
2025-05-07T20:33:08.4915451Z     
2025-05-07T20:33:08.4915594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4915689Z             op = silu_mul_quant
2025-05-07T20:33:08.4915779Z             if compiled:
2025-05-07T20:33:08.4915897Z                 op = torch.compile(op)
2025-05-07T20:33:08.4916010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4916087Z     
2025-05-07T20:33:08.4916193Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.4916322Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.4916408Z     
2025-05-07T20:33:08.4916547Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4916652Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.4916754Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.4916890Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.4917032Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4917114Z     
2025-05-07T20:33:08.4917226Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.4917231Z 
2025-05-07T20:33:08.4917331Z moe/activation_test.py:126: 
2025-05-07T20:33:08.4917473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4917586Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.4917728Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.4918297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.4918403Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.4918768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4919003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4919412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.4919692Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4920093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.4920351Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.4920770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.4920939Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.4921285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.4921366Z     fn()
2025-05-07T20:33:08.4921760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.4921861Z     self.fn.run(
2025-05-07T20:33:08.4922194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4922290Z     kernel = self.compile(
2025-05-07T20:33:08.4922714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4922899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4923039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4923043Z 
2025-05-07T20:33:08.4923253Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f67ba00>
2025-05-07T20:33:08.4924039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4924600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f6ace50>}
2025-05-07T20:33:08.4925359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4925561Z context = <triton._C.libtriton.ir.context object at 0x7f158ef335f0>
2025-05-07T20:33:08.4925569Z 
2025-05-07T20:33:08.4925739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4926011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4926124Z                            module_map=module_map)
2025-05-07T20:33:08.4926287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4926399Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.4926484Z E       ^
2025-05-07T20:33:08.4926843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4926848Z 
2025-05-07T20:33:08.4927274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4927278Z 
2025-05-07T20:33:08.4927384Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4927613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4927694Z     T=1,
2025-05-07T20:33:08.4927773Z     D=5120,
2025-05-07T20:33:08.4927863Z     scale_ub=None,
2025-05-07T20:33:08.4927950Z     contiguous=True,
2025-05-07T20:33:08.4928036Z     compiled=False,
2025-05-07T20:33:08.4928116Z )
2025-05-07T20:33:08.4928334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4928500Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.4928546Z 
2025-05-07T20:33:08.4928636Z     @given(
2025-05-07T20:33:08.4928757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4928866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4928983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4929106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4929228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4929343Z     )
2025-05-07T20:33:08.4929593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4929695Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4929774Z         self,
2025-05-07T20:33:08.4929853Z         T: int,
2025-05-07T20:33:08.4929939Z         D: int,
2025-05-07T20:33:08.4930039Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4930131Z         contiguous: bool,
2025-05-07T20:33:08.4930227Z         compiled: bool,
2025-05-07T20:33:08.4930308Z     ) -> None:
2025-05-07T20:33:08.4930418Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4930495Z     
2025-05-07T20:33:08.4930669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4930753Z     
2025-05-07T20:33:08.4930847Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4931037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4931136Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4931218Z         x0 = x[:, :D]
2025-05-07T20:33:08.4931303Z         x1 = x[:, D:]
2025-05-07T20:33:08.4931384Z     
2025-05-07T20:33:08.4931470Z         if contiguous:
2025-05-07T20:33:08.4931565Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4931663Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4931737Z     
2025-05-07T20:33:08.4931836Z         if scale_ub is not None:
2025-05-07T20:33:08.4931943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4932083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4932240Z             )
2025-05-07T20:33:08.4932321Z         else:
2025-05-07T20:33:08.4932417Z             scale_ub_tensor = None
2025-05-07T20:33:08.4932499Z     
2025-05-07T20:33:08.4932633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4932726Z             op = silu_mul_quant
2025-05-07T20:33:08.4932822Z             if compiled:
2025-05-07T20:33:08.4932926Z                 op = torch.compile(op)
2025-05-07T20:33:08.4933034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4933119Z     
2025-05-07T20:33:08.4933212Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4933217Z 
2025-05-07T20:33:08.4933326Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4933457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4933561Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4933669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4934169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4934272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4934635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4934863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4935215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4935313Z     kernel = self.compile(
2025-05-07T20:33:08.4935692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4935879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4936008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4936013Z 
2025-05-07T20:33:08.4936268Z self = <triton.compiler.compiler.ASTSource object at 0x7f158efa0160>
2025-05-07T20:33:08.4937047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4937555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158fdb0790>}
2025-05-07T20:33:08.4938355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4938550Z context = <triton._C.libtriton.ir.context object at 0x7f158f2deb30>
2025-05-07T20:33:08.4938555Z 
2025-05-07T20:33:08.4938729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4939009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4939121Z                            module_map=module_map)
2025-05-07T20:33:08.4939293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4939393Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4939510Z E       ^
2025-05-07T20:33:08.4939877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4939884Z 
2025-05-07T20:33:08.4940642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4940652Z 
2025-05-07T20:33:08.4940798Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4941074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4941155Z     T=128,
2025-05-07T20:33:08.4941242Z     D=5120,
2025-05-07T20:33:08.4941327Z     scale_ub=None,
2025-05-07T20:33:08.4941621Z     contiguous=False,
2025-05-07T20:33:08.4941708Z     compiled=True,
2025-05-07T20:33:08.4941785Z )
2025-05-07T20:33:08.4942017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4942194Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.4942203Z 
2025-05-07T20:33:08.4942284Z     @given(
2025-05-07T20:33:08.4942412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4942516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4942633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4942758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4942874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4942957Z     )
2025-05-07T20:33:08.4943207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4943303Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4943395Z         self,
2025-05-07T20:33:08.4943474Z         T: int,
2025-05-07T20:33:08.4943553Z         D: int,
2025-05-07T20:33:08.4943661Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4943753Z         contiguous: bool,
2025-05-07T20:33:08.4943842Z         compiled: bool,
2025-05-07T20:33:08.4943933Z     ) -> None:
2025-05-07T20:33:08.4944032Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4944108Z     
2025-05-07T20:33:08.4944287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4944370Z     
2025-05-07T20:33:08.4944466Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4944608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4944704Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4944788Z         x0 = x[:, :D]
2025-05-07T20:33:08.4944877Z         x1 = x[:, D:]
2025-05-07T20:33:08.4944952Z     
2025-05-07T20:33:08.4945041Z         if contiguous:
2025-05-07T20:33:08.4945143Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4945306Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4945393Z     
2025-05-07T20:33:08.4945489Z         if scale_ub is not None:
2025-05-07T20:33:08.4945598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4945747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4945827Z             )
2025-05-07T20:33:08.4945911Z         else:
2025-05-07T20:33:08.4946018Z             scale_ub_tensor = None
2025-05-07T20:33:08.4946151Z     
2025-05-07T20:33:08.4946289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4946390Z             op = silu_mul_quant
2025-05-07T20:33:08.4946479Z             if compiled:
2025-05-07T20:33:08.4946582Z                 op = torch.compile(op)
2025-05-07T20:33:08.4946703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4946780Z     
2025-05-07T20:33:08.4946887Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4946891Z 
2025-05-07T20:33:08.4946996Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4947140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4947251Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4947355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4947789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.4947894Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.4948389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4948499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4948856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4949083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4949430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4949584Z     kernel = self.compile(
2025-05-07T20:33:08.4949962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4950148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4950282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4950286Z 
2025-05-07T20:33:08.4950506Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f30cc40>
2025-05-07T20:33:08.4951271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4951776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ee51040>}
2025-05-07T20:33:08.4952537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4952734Z context = <triton._C.libtriton.ir.context object at 0x7f158eeb9170>
2025-05-07T20:33:08.4952739Z 
2025-05-07T20:33:08.4952916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4953187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4953302Z                            module_map=module_map)
2025-05-07T20:33:08.4953480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4953582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4953673Z E       ^
2025-05-07T20:33:08.4954069Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4954077Z 
2025-05-07T20:33:08.4954495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4954500Z 
2025-05-07T20:33:08.4954614Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4954841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4954926Z     T=128,
2025-05-07T20:33:08.4955006Z     D=7168,
2025-05-07T20:33:08.4955131Z     scale_ub=1200.0,
2025-05-07T20:33:08.4955233Z     contiguous=False,
2025-05-07T20:33:08.4955322Z     compiled=False,
2025-05-07T20:33:08.4955398Z )
2025-05-07T20:33:08.4955624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4955800Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.4955804Z 
2025-05-07T20:33:08.4955885Z     @given(
2025-05-07T20:33:08.4956013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4956123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4956250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4956372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4956488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4956570Z     )
2025-05-07T20:33:08.4956858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4956960Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4957046Z         self,
2025-05-07T20:33:08.4957129Z         T: int,
2025-05-07T20:33:08.4957211Z         D: int,
2025-05-07T20:33:08.4957317Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4957410Z         contiguous: bool,
2025-05-07T20:33:08.4957499Z         compiled: bool,
2025-05-07T20:33:08.4957586Z     ) -> None:
2025-05-07T20:33:08.4957688Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4957772Z     
2025-05-07T20:33:08.4957952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4958071Z     
2025-05-07T20:33:08.4958173Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4958302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4958395Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4958486Z         x0 = x[:, :D]
2025-05-07T20:33:08.4958573Z         x1 = x[:, D:]
2025-05-07T20:33:08.4958649Z     
2025-05-07T20:33:08.4958746Z         if contiguous:
2025-05-07T20:33:08.4958844Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4958939Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4959025Z     
2025-05-07T20:33:08.4959117Z         if scale_ub is not None:
2025-05-07T20:33:08.4959225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4959375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4959456Z             )
2025-05-07T20:33:08.4959545Z         else:
2025-05-07T20:33:08.4959642Z             scale_ub_tensor = None
2025-05-07T20:33:08.4959724Z     
2025-05-07T20:33:08.4959866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4959963Z             op = silu_mul_quant
2025-05-07T20:33:08.4960052Z             if compiled:
2025-05-07T20:33:08.4960167Z                 op = torch.compile(op)
2025-05-07T20:33:08.4960279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4960356Z     
2025-05-07T20:33:08.4960461Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4960465Z 
2025-05-07T20:33:08.4960572Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4960713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4960817Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4960921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4961423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4961523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4962778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4963028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4963382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4963485Z     kernel = self.compile(
2025-05-07T20:33:08.4963864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4964115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4964251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4964255Z 
2025-05-07T20:33:08.4964465Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f47f850>
2025-05-07T20:33:08.4965260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4965805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ee51ca0>}
2025-05-07T20:33:08.4966559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4966769Z context = <triton._C.libtriton.ir.context object at 0x7f158ef01cf0>
2025-05-07T20:33:08.4966774Z 
2025-05-07T20:33:08.4966941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4967213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4967329Z                            module_map=module_map)
2025-05-07T20:33:08.4967534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4967641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4967720Z E       ^
2025-05-07T20:33:08.4968081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4968094Z 
2025-05-07T20:33:08.4968511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4968518Z 
2025-05-07T20:33:08.4968622Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4968850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4968932Z     T=128,
2025-05-07T20:33:08.4969011Z     D=5120,
2025-05-07T20:33:08.4969105Z     scale_ub=None,
2025-05-07T20:33:08.4969194Z     contiguous=False,
2025-05-07T20:33:08.4969281Z     compiled=False,
2025-05-07T20:33:08.4969362Z )
2025-05-07T20:33:08.4969584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4969766Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.4969770Z 
2025-05-07T20:33:08.4969850Z     @given(
2025-05-07T20:33:08.4969971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4970080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4970198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4970321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4970445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4970522Z     )
2025-05-07T20:33:08.4970778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4970873Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4970952Z         self,
2025-05-07T20:33:08.4971036Z         T: int,
2025-05-07T20:33:08.4971119Z         D: int,
2025-05-07T20:33:08.4971268Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4971370Z         contiguous: bool,
2025-05-07T20:33:08.4971459Z         compiled: bool,
2025-05-07T20:33:08.4971540Z     ) -> None:
2025-05-07T20:33:08.4971646Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4971721Z     
2025-05-07T20:33:08.4971895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4971980Z     
2025-05-07T20:33:08.4972074Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4972241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4972340Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4972424Z         x0 = x[:, :D]
2025-05-07T20:33:08.4972511Z         x1 = x[:, D:]
2025-05-07T20:33:08.4972585Z     
2025-05-07T20:33:08.4972671Z         if contiguous:
2025-05-07T20:33:08.4972771Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4972863Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4972938Z     
2025-05-07T20:33:08.4973036Z         if scale_ub is not None:
2025-05-07T20:33:08.4973151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4973288Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4973372Z             )
2025-05-07T20:33:08.4973452Z         else:
2025-05-07T20:33:08.4973551Z             scale_ub_tensor = None
2025-05-07T20:33:08.4973631Z     
2025-05-07T20:33:08.4973804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4973905Z             op = silu_mul_quant
2025-05-07T20:33:08.4973997Z             if compiled:
2025-05-07T20:33:08.4974100Z                 op = torch.compile(op)
2025-05-07T20:33:08.4974212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4974287Z     
2025-05-07T20:33:08.4974380Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4974385Z 
2025-05-07T20:33:08.4974491Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4974623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4974770Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4974880Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4975378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4975485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4975846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4976079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4976428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4976524Z     kernel = self.compile(
2025-05-07T20:33:08.4976903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4977086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4977221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4977226Z 
2025-05-07T20:33:08.4977439Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f4619a0>
2025-05-07T20:33:08.4978210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4978765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f3fd310>}
2025-05-07T20:33:08.4979530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4979767Z context = <triton._C.libtriton.ir.context object at 0x7f158f3ea270>
2025-05-07T20:33:08.4979775Z 
2025-05-07T20:33:08.4979954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4980223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4980344Z                            module_map=module_map)
2025-05-07T20:33:08.4980510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4980613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4980740Z E       ^
2025-05-07T20:33:08.4981206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4981211Z 
2025-05-07T20:33:08.4981621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4981633Z 
2025-05-07T20:33:08.4981737Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4981968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4982057Z     T=128,
2025-05-07T20:33:08.4982137Z     D=5120,
2025-05-07T20:33:08.4982223Z     scale_ub=1200.0,
2025-05-07T20:33:08.4982318Z     contiguous=True,
2025-05-07T20:33:08.4982405Z     compiled=False,
2025-05-07T20:33:08.4982481Z )
2025-05-07T20:33:08.4982748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4982925Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.4982932Z 
2025-05-07T20:33:08.4983020Z     @given(
2025-05-07T20:33:08.4983141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4983243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4983366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4983485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4983601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4983732Z     )
2025-05-07T20:33:08.4983984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4984083Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4984169Z         self,
2025-05-07T20:33:08.4984248Z         T: int,
2025-05-07T20:33:08.4984329Z         D: int,
2025-05-07T20:33:08.4984438Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4984529Z         contiguous: bool,
2025-05-07T20:33:08.4984625Z         compiled: bool,
2025-05-07T20:33:08.4984708Z     ) -> None:
2025-05-07T20:33:08.4984806Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4984888Z     
2025-05-07T20:33:08.4985063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4985137Z     
2025-05-07T20:33:08.4985237Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4985365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4985457Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4985544Z         x0 = x[:, :D]
2025-05-07T20:33:08.4985630Z         x1 = x[:, D:]
2025-05-07T20:33:08.4985705Z     
2025-05-07T20:33:08.4985798Z         if contiguous:
2025-05-07T20:33:08.4985892Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4985984Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4986067Z     
2025-05-07T20:33:08.4986160Z         if scale_ub is not None:
2025-05-07T20:33:08.4986277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4986417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4986498Z             )
2025-05-07T20:33:08.4986584Z         else:
2025-05-07T20:33:08.4986681Z             scale_ub_tensor = None
2025-05-07T20:33:08.4986756Z     
2025-05-07T20:33:08.4986899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4986993Z             op = silu_mul_quant
2025-05-07T20:33:08.4987080Z             if compiled:
2025-05-07T20:33:08.4987189Z                 op = torch.compile(op)
2025-05-07T20:33:08.4987298Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4987425Z     
2025-05-07T20:33:08.4987528Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4987532Z 
2025-05-07T20:33:08.4987632Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4987768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4987873Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4987978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4988487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4988643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4989044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4989273Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4989612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4989717Z     kernel = self.compile(
2025-05-07T20:33:08.4990094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4990274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4990446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4990451Z 
2025-05-07T20:33:08.4990663Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f3d4940>
2025-05-07T20:33:08.4991437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4991954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f3fdee0>}
2025-05-07T20:33:08.4992744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4992949Z context = <triton._C.libtriton.ir.context object at 0x7f158ede76f0>
2025-05-07T20:33:08.4992954Z 
2025-05-07T20:33:08.4993122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4993410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4993520Z                            module_map=module_map)
2025-05-07T20:33:08.4993685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4993793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4993873Z E       ^
2025-05-07T20:33:08.4994237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4994244Z 
2025-05-07T20:33:08.4994655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4994660Z 
2025-05-07T20:33:08.4994766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4994998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4995079Z     T=1,
2025-05-07T20:33:08.4995159Z     D=7168,
2025-05-07T20:33:08.4995255Z     scale_ub=1200.0,
2025-05-07T20:33:08.4995344Z     contiguous=True,
2025-05-07T20:33:08.4995439Z     compiled=True,
2025-05-07T20:33:08.4995516Z )
2025-05-07T20:33:08.4995736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4995911Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.4995916Z 
2025-05-07T20:33:08.4995997Z     @given(
2025-05-07T20:33:08.4996118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4996308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4996433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4996554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4996679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4996757Z     )
2025-05-07T20:33:08.4997015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4997111Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4997229Z         self,
2025-05-07T20:33:08.4997317Z         T: int,
2025-05-07T20:33:08.4997397Z         D: int,
2025-05-07T20:33:08.4997498Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4997596Z         contiguous: bool,
2025-05-07T20:33:08.4997684Z         compiled: bool,
2025-05-07T20:33:08.4997766Z     ) -> None:
2025-05-07T20:33:08.4997868Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4997947Z     
2025-05-07T20:33:08.4998124Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4998211Z     
2025-05-07T20:33:08.4998306Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4998442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4998533Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4998618Z         x0 = x[:, :D]
2025-05-07T20:33:08.4998747Z         x1 = x[:, D:]
2025-05-07T20:33:08.4998824Z     
2025-05-07T20:33:08.4998910Z         if contiguous:
2025-05-07T20:33:08.4999010Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4999104Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4999179Z     
2025-05-07T20:33:08.4999279Z         if scale_ub is not None:
2025-05-07T20:33:08.4999386Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4999525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4999614Z             )
2025-05-07T20:33:08.4999693Z         else:
2025-05-07T20:33:08.4999796Z             scale_ub_tensor = None
2025-05-07T20:33:08.4999917Z     
2025-05-07T20:33:08.5000056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5000156Z             op = silu_mul_quant
2025-05-07T20:33:08.5000245Z             if compiled:
2025-05-07T20:33:08.5000347Z                 op = torch.compile(op)
2025-05-07T20:33:08.5000463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5000543Z     
2025-05-07T20:33:08.5000637Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5000641Z 
2025-05-07T20:33:08.5000748Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5000881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5000985Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5001094Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5001460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5001563Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5002055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5002157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5002518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5002746Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5003092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5003191Z     kernel = self.compile(
2025-05-07T20:33:08.5003568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5003753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5003881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5003886Z 
2025-05-07T20:33:08.5004137Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f28cd00>
2025-05-07T20:33:08.5004914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5005418Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158edd6940>}
2025-05-07T20:33:08.5006220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5006414Z context = <triton._C.libtriton.ir.context object at 0x7f158f2107b0>
2025-05-07T20:33:08.5006419Z 
2025-05-07T20:33:08.5006591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5006860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5006971Z                            module_map=module_map)
2025-05-07T20:33:08.5007141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5007277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5007358Z E       ^
2025-05-07T20:33:08.5007716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5007723Z 
2025-05-07T20:33:08.5014190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5014200Z 
2025-05-07T20:33:08.5014333Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5014575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5014663Z     T=1,
2025-05-07T20:33:08.5014745Z     D=7168,
2025-05-07T20:33:08.5014927Z     scale_ub=1200.0,
2025-05-07T20:33:08.5015021Z     contiguous=False,
2025-05-07T20:33:08.5015111Z     compiled=True,
2025-05-07T20:33:08.5015199Z )
2025-05-07T20:33:08.5015422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5015596Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5015601Z 
2025-05-07T20:33:08.5015692Z     @given(
2025-05-07T20:33:08.5015819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5015933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5016053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5016174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5016300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5016379Z     )
2025-05-07T20:33:08.5016631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5016749Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5016831Z         self,
2025-05-07T20:33:08.5016915Z         T: int,
2025-05-07T20:33:08.5017003Z         D: int,
2025-05-07T20:33:08.5017107Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5017200Z         contiguous: bool,
2025-05-07T20:33:08.5017299Z         compiled: bool,
2025-05-07T20:33:08.5017386Z     ) -> None:
2025-05-07T20:33:08.5017493Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5017572Z     
2025-05-07T20:33:08.5017753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5017839Z     
2025-05-07T20:33:08.5017937Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5018069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5018172Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5018258Z         x0 = x[:, :D]
2025-05-07T20:33:08.5018344Z         x1 = x[:, D:]
2025-05-07T20:33:08.5018430Z     
2025-05-07T20:33:08.5018518Z         if contiguous:
2025-05-07T20:33:08.5018662Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5018768Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5018844Z     
2025-05-07T20:33:08.5018939Z         if scale_ub is not None:
2025-05-07T20:33:08.5019057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5019200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5019291Z             )
2025-05-07T20:33:08.5019371Z         else:
2025-05-07T20:33:08.5019470Z             scale_ub_tensor = None
2025-05-07T20:33:08.5019595Z     
2025-05-07T20:33:08.5019732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5019826Z             op = silu_mul_quant
2025-05-07T20:33:08.5019928Z             if compiled:
2025-05-07T20:33:08.5020032Z                 op = torch.compile(op)
2025-05-07T20:33:08.5020147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5020231Z     
2025-05-07T20:33:08.5020328Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5020333Z 
2025-05-07T20:33:08.5020452Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5020586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5020691Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5020803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5021363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5021462Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5021977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5022080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5022450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5022677Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5023023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5023204Z     kernel = self.compile(
2025-05-07T20:33:08.5023583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5023765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5023902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5023909Z 
2025-05-07T20:33:08.5024121Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f1e6b50>
2025-05-07T20:33:08.5024900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5025405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f1f05e0>}
2025-05-07T20:33:08.5026169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5026368Z context = <triton._C.libtriton.ir.context object at 0x7f158f08d7f0>
2025-05-07T20:33:08.5026373Z 
2025-05-07T20:33:08.5026543Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5026817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5026931Z                            module_map=module_map)
2025-05-07T20:33:08.5027102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5027214Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5027297Z E       ^
2025-05-07T20:33:08.5027699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5027707Z 
2025-05-07T20:33:08.5028122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5028127Z 
2025-05-07T20:33:08.5028235Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5028470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5028552Z     T=1,
2025-05-07T20:33:08.5028682Z     D=7168,
2025-05-07T20:33:08.5028767Z     scale_ub=None,
2025-05-07T20:33:08.5028859Z     contiguous=False,
2025-05-07T20:33:08.5028952Z     compiled=True,
2025-05-07T20:33:08.5029031Z )
2025-05-07T20:33:08.5029250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5029425Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5029430Z 
2025-05-07T20:33:08.5029512Z     @given(
2025-05-07T20:33:08.5029640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5029758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5029878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5030011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5030128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5030247Z     )
2025-05-07T20:33:08.5030508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5030607Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5030690Z         self,
2025-05-07T20:33:08.5030778Z         T: int,
2025-05-07T20:33:08.5030865Z         D: int,
2025-05-07T20:33:08.5030968Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5031069Z         contiguous: bool,
2025-05-07T20:33:08.5031159Z         compiled: bool,
2025-05-07T20:33:08.5031242Z     ) -> None:
2025-05-07T20:33:08.5031351Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5031429Z     
2025-05-07T20:33:08.5031650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5031739Z     
2025-05-07T20:33:08.5031835Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5031975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5032070Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5032159Z         x0 = x[:, :D]
2025-05-07T20:33:08.5032256Z         x1 = x[:, D:]
2025-05-07T20:33:08.5032336Z     
2025-05-07T20:33:08.5032424Z         if contiguous:
2025-05-07T20:33:08.5032532Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5032626Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5032705Z     
2025-05-07T20:33:08.5032808Z         if scale_ub is not None:
2025-05-07T20:33:08.5032918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5033059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5033154Z             )
2025-05-07T20:33:08.5033235Z         else:
2025-05-07T20:33:08.5033345Z             scale_ub_tensor = None
2025-05-07T20:33:08.5033425Z     
2025-05-07T20:33:08.5033560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5033664Z             op = silu_mul_quant
2025-05-07T20:33:08.5033755Z             if compiled:
2025-05-07T20:33:08.5033858Z                 op = torch.compile(op)
2025-05-07T20:33:08.5033981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5034057Z     
2025-05-07T20:33:08.5034153Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.5034287Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.5034366Z     
2025-05-07T20:33:08.5034505Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5034619Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.5034724Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.5034858Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.5035002Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.5035128Z     
2025-05-07T20:33:08.5035243Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.5035248Z 
2025-05-07T20:33:08.5035350Z moe/activation_test.py:126: 
2025-05-07T20:33:08.5035483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5035604Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.5035742Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.5036316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.5036463Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.5036825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5037059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5037428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.5037696Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.5038100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.5038391Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.5038783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.5038955Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.5039303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.5039393Z     fn()
2025-05-07T20:33:08.5039796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.5039938Z     self.fn.run(
2025-05-07T20:33:08.5040667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5040817Z     kernel = self.compile(
2025-05-07T20:33:08.5041415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5041599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5041736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5041741Z 
2025-05-07T20:33:08.5041956Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f0aa550>
2025-05-07T20:33:08.5042732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5043260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f042160>}
2025-05-07T20:33:08.5044024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5044228Z context = <triton._C.libtriton.ir.context object at 0x7f158f03db30>
2025-05-07T20:33:08.5044235Z 
2025-05-07T20:33:08.5044407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5044674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5044793Z                            module_map=module_map)
2025-05-07T20:33:08.5044958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5045063Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.5045339Z E       ^
2025-05-07T20:33:08.5045704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5045709Z 
2025-05-07T20:33:08.5046129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5046136Z 
2025-05-07T20:33:08.5046243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5046467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5046616Z     T=1,
2025-05-07T20:33:08.5046696Z     D=5120,
2025-05-07T20:33:08.5046789Z     scale_ub=1200.0,
2025-05-07T20:33:08.5046887Z     contiguous=False,
2025-05-07T20:33:08.5046975Z     compiled=True,
2025-05-07T20:33:08.5047061Z )
2025-05-07T20:33:08.5047280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5047454Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5047464Z 
2025-05-07T20:33:08.5047553Z     @given(
2025-05-07T20:33:08.5047677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5047781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5047911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5048098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5048228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5048307Z     )
2025-05-07T20:33:08.5048559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5048665Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5048748Z         self,
2025-05-07T20:33:08.5048830Z         T: int,
2025-05-07T20:33:08.5048920Z         D: int,
2025-05-07T20:33:08.5049024Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5049117Z         contiguous: bool,
2025-05-07T20:33:08.5049219Z         compiled: bool,
2025-05-07T20:33:08.5049306Z     ) -> None:
2025-05-07T20:33:08.5049476Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5049566Z     
2025-05-07T20:33:08.5049744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5049828Z     
2025-05-07T20:33:08.5049935Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5050068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5050169Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5050255Z         x0 = x[:, :D]
2025-05-07T20:33:08.5050341Z         x1 = x[:, D:]
2025-05-07T20:33:08.5050430Z     
2025-05-07T20:33:08.5050519Z         if contiguous:
2025-05-07T20:33:08.5050616Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5050718Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5050801Z     
2025-05-07T20:33:08.5050895Z         if scale_ub is not None:
2025-05-07T20:33:08.5051004Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5051150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5051234Z             )
2025-05-07T20:33:08.5051326Z         else:
2025-05-07T20:33:08.5051424Z             scale_ub_tensor = None
2025-05-07T20:33:08.5051500Z     
2025-05-07T20:33:08.5051643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5051737Z             op = silu_mul_quant
2025-05-07T20:33:08.5051826Z             if compiled:
2025-05-07T20:33:08.5051940Z                 op = torch.compile(op)
2025-05-07T20:33:08.5052049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5052126Z     
2025-05-07T20:33:08.5052230Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5052234Z 
2025-05-07T20:33:08.5052335Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5052466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5052577Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5052680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5053110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5053213Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5053705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5053817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5054181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5054518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5054855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5054953Z     kernel = self.compile(
2025-05-07T20:33:08.5055337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5055514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5055652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5055657Z 
2025-05-07T20:33:08.5055877Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f43a190>
2025-05-07T20:33:08.5056718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5057236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f042b80>}
2025-05-07T20:33:08.5057988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5058193Z context = <triton._C.libtriton.ir.context object at 0x7f158ead80f0>
2025-05-07T20:33:08.5058236Z 
2025-05-07T20:33:08.5058407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5058677Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5058798Z                            module_map=module_map)
2025-05-07T20:33:08.5058962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5059062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5059150Z E       ^
2025-05-07T20:33:08.5059502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5059507Z 
2025-05-07T20:33:08.5059928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5059933Z 
2025-05-07T20:33:08.5060038Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5060264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5060354Z     T=1,
2025-05-07T20:33:08.5060433Z     D=5120,
2025-05-07T20:33:08.5060520Z     scale_ub=1200.0,
2025-05-07T20:33:08.5060613Z     contiguous=False,
2025-05-07T20:33:08.5060701Z     compiled=False,
2025-05-07T20:33:08.5060782Z )
2025-05-07T20:33:08.5061092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5061270Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5061278Z 
2025-05-07T20:33:08.5061366Z     @given(
2025-05-07T20:33:08.5061486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5061591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5061720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5061840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5061962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5062050Z     )
2025-05-07T20:33:08.5062352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5062460Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5062540Z         self,
2025-05-07T20:33:08.5062620Z         T: int,
2025-05-07T20:33:08.5062708Z         D: int,
2025-05-07T20:33:08.5062811Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5062903Z         contiguous: bool,
2025-05-07T20:33:08.5062999Z         compiled: bool,
2025-05-07T20:33:08.5063121Z     ) -> None:
2025-05-07T20:33:08.5063219Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5063301Z     
2025-05-07T20:33:08.5063475Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5063553Z     
2025-05-07T20:33:08.5063655Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5063782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5063886Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5063968Z         x0 = x[:, :D]
2025-05-07T20:33:08.5064061Z         x1 = x[:, D:]
2025-05-07T20:33:08.5064144Z     
2025-05-07T20:33:08.5064231Z         if contiguous:
2025-05-07T20:33:08.5064324Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5064422Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5064498Z     
2025-05-07T20:33:08.5064591Z         if scale_ub is not None:
2025-05-07T20:33:08.5064750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5064890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5064973Z             )
2025-05-07T20:33:08.5065067Z         else:
2025-05-07T20:33:08.5065165Z             scale_ub_tensor = None
2025-05-07T20:33:08.5065247Z     
2025-05-07T20:33:08.5065381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5065474Z             op = silu_mul_quant
2025-05-07T20:33:08.5065567Z             if compiled:
2025-05-07T20:33:08.5065672Z                 op = torch.compile(op)
2025-05-07T20:33:08.5065779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5065906Z     
2025-05-07T20:33:08.5066002Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5066007Z 
2025-05-07T20:33:08.5066108Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5066248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5066358Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5066464Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5066969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5067073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5067437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5067663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5068007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5068114Z     kernel = self.compile(
2025-05-07T20:33:08.5068493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5068683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5068815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5068819Z 
2025-05-07T20:33:08.5069028Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f224220>
2025-05-07T20:33:08.5069810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5070354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f44c550>}
2025-05-07T20:33:08.5071118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5071312Z context = <triton._C.libtriton.ir.context object at 0x7f158f417930>
2025-05-07T20:33:08.5071317Z 
2025-05-07T20:33:08.5071487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5071799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5071912Z                            module_map=module_map)
2025-05-07T20:33:08.5072083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5072187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5072269Z E       ^
2025-05-07T20:33:08.5072632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5072640Z 
2025-05-07T20:33:08.5073050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5073054Z 
2025-05-07T20:33:08.5073166Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5073429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5073512Z     T=16384,
2025-05-07T20:33:08.5073600Z     D=5120,
2025-05-07T20:33:08.5073692Z     scale_ub=1200.0,
2025-05-07T20:33:08.5073782Z     contiguous=False,
2025-05-07T20:33:08.5073876Z     compiled=True,
2025-05-07T20:33:08.5073953Z )
2025-05-07T20:33:08.5074171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5074357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5074361Z 
2025-05-07T20:33:08.5074441Z     @given(
2025-05-07T20:33:08.5074571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5074718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5074836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5074963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5075079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5075156Z     )
2025-05-07T20:33:08.5075415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5075513Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5075593Z         self,
2025-05-07T20:33:08.5075678Z         T: int,
2025-05-07T20:33:08.5075757Z         D: int,
2025-05-07T20:33:08.5075865Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5075958Z         contiguous: bool,
2025-05-07T20:33:08.5076046Z         compiled: bool,
2025-05-07T20:33:08.5076134Z     ) -> None:
2025-05-07T20:33:08.5076232Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5076310Z     
2025-05-07T20:33:08.5076488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5076587Z     
2025-05-07T20:33:08.5076683Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5076819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5076914Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5076997Z         x0 = x[:, :D]
2025-05-07T20:33:08.5077091Z         x1 = x[:, D:]
2025-05-07T20:33:08.5077166Z     
2025-05-07T20:33:08.5077254Z         if contiguous:
2025-05-07T20:33:08.5077361Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5077454Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5077529Z     
2025-05-07T20:33:08.5077630Z         if scale_ub is not None:
2025-05-07T20:33:08.5077737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5077879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5077970Z             )
2025-05-07T20:33:08.5078051Z         else:
2025-05-07T20:33:08.5078149Z             scale_ub_tensor = None
2025-05-07T20:33:08.5078234Z     
2025-05-07T20:33:08.5078416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5078521Z             op = silu_mul_quant
2025-05-07T20:33:08.5078610Z             if compiled:
2025-05-07T20:33:08.5078714Z                 op = torch.compile(op)
2025-05-07T20:33:08.5078829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5078908Z     
2025-05-07T20:33:08.5079002Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5079006Z 
2025-05-07T20:33:08.5079152Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5079284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5079389Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5079500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5079867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5079969Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5080461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5080563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5080926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5081194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5081545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5081646Z     kernel = self.compile(
2025-05-07T20:33:08.5082022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5082205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5082334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5082338Z 
2025-05-07T20:33:08.5082605Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f134a00>
2025-05-07T20:33:08.5083385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5083897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea901f0>}
2025-05-07T20:33:08.5084662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5084854Z context = <triton._C.libtriton.ir.context object at 0x7f158eacb970>
2025-05-07T20:33:08.5084859Z 
2025-05-07T20:33:08.5085032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5085307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5085418Z                            module_map=module_map)
2025-05-07T20:33:08.5085592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5085693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5085772Z E       ^
2025-05-07T20:33:08.5086132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5086140Z 
2025-05-07T20:33:08.5086548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5086553Z 
2025-05-07T20:33:08.5086662Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5086885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5086965Z     T=2048,
2025-05-07T20:33:08.5087055Z     D=7168,
2025-05-07T20:33:08.5087187Z     scale_ub=1200.0,
2025-05-07T20:33:08.5087278Z     contiguous=False,
2025-05-07T20:33:08.5087376Z     compiled=True,
2025-05-07T20:33:08.5087452Z )
2025-05-07T20:33:08.5087677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5087855Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5087859Z 
2025-05-07T20:33:08.5087940Z     @given(
2025-05-07T20:33:08.5088138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5088240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5088357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5088483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5088598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5088675Z     )
2025-05-07T20:33:08.5088929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5089030Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5089115Z         self,
2025-05-07T20:33:08.5089194Z         T: int,
2025-05-07T20:33:08.5089275Z         D: int,
2025-05-07T20:33:08.5089381Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5089470Z         contiguous: bool,
2025-05-07T20:33:08.5089558Z         compiled: bool,
2025-05-07T20:33:08.5089686Z     ) -> None:
2025-05-07T20:33:08.5089785Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5089860Z     
2025-05-07T20:33:08.5090037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5090113Z     
2025-05-07T20:33:08.5090206Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5090344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5090435Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5090525Z         x0 = x[:, :D]
2025-05-07T20:33:08.5090608Z         x1 = x[:, D:]
2025-05-07T20:33:08.5090683Z     
2025-05-07T20:33:08.5090780Z         if contiguous:
2025-05-07T20:33:08.5090926Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5091018Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5091101Z     
2025-05-07T20:33:08.5091195Z         if scale_ub is not None:
2025-05-07T20:33:08.5091303Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5091454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5091535Z             )
2025-05-07T20:33:08.5091615Z         else:
2025-05-07T20:33:08.5091718Z             scale_ub_tensor = None
2025-05-07T20:33:08.5091796Z     
2025-05-07T20:33:08.5091930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5092032Z             op = silu_mul_quant
2025-05-07T20:33:08.5092124Z             if compiled:
2025-05-07T20:33:08.5092234Z                 op = torch.compile(op)
2025-05-07T20:33:08.5092344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5092419Z     
2025-05-07T20:33:08.5092523Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5092530Z 
2025-05-07T20:33:08.5092631Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5092762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5092872Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5092981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5093354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5093457Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5093949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5094054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5094409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5094636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5095022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5095123Z     kernel = self.compile(
2025-05-07T20:33:08.5095507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5095690Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5095818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5095861Z 
2025-05-07T20:33:08.5096084Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eaad700>
2025-05-07T20:33:08.5096868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5097383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea90ee0>}
2025-05-07T20:33:08.5098124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5098352Z context = <triton._C.libtriton.ir.context object at 0x7f158f165730>
2025-05-07T20:33:08.5098366Z 
2025-05-07T20:33:08.5098537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5098806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5098923Z                            module_map=module_map)
2025-05-07T20:33:08.5099087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5099187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5099271Z E       ^
2025-05-07T20:33:08.5099628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5099675Z 
2025-05-07T20:33:08.5100100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5100105Z 
2025-05-07T20:33:08.5100211Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5100438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5100526Z     T=1,
2025-05-07T20:33:08.5100605Z     D=5120,
2025-05-07T20:33:08.5100688Z     scale_ub=None,
2025-05-07T20:33:08.5100783Z     contiguous=False,
2025-05-07T20:33:08.5100870Z     compiled=False,
2025-05-07T20:33:08.5100946Z )
2025-05-07T20:33:08.5101273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5101443Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5101448Z 
2025-05-07T20:33:08.5101535Z     @given(
2025-05-07T20:33:08.5101660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5101762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5101887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5102008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5102125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5102209Z     )
2025-05-07T20:33:08.5102458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5102557Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5102643Z         self,
2025-05-07T20:33:08.5102722Z         T: int,
2025-05-07T20:33:08.5102807Z         D: int,
2025-05-07T20:33:08.5102909Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5103000Z         contiguous: bool,
2025-05-07T20:33:08.5103094Z         compiled: bool,
2025-05-07T20:33:08.5103175Z     ) -> None:
2025-05-07T20:33:08.5103272Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5103356Z     
2025-05-07T20:33:08.5103579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5103659Z     
2025-05-07T20:33:08.5103760Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5103888Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5103979Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5104071Z         x0 = x[:, :D]
2025-05-07T20:33:08.5104153Z         x1 = x[:, D:]
2025-05-07T20:33:08.5104229Z     
2025-05-07T20:33:08.5104362Z         if contiguous:
2025-05-07T20:33:08.5104458Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5104557Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5104634Z     
2025-05-07T20:33:08.5104728Z         if scale_ub is not None:
2025-05-07T20:33:08.5104842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5104982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5105061Z             )
2025-05-07T20:33:08.5105148Z         else:
2025-05-07T20:33:08.5105251Z             scale_ub_tensor = None
2025-05-07T20:33:08.5105327Z     
2025-05-07T20:33:08.5105467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5105559Z             op = silu_mul_quant
2025-05-07T20:33:08.5105646Z             if compiled:
2025-05-07T20:33:08.5105756Z                 op = torch.compile(op)
2025-05-07T20:33:08.5105907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5105991Z     
2025-05-07T20:33:08.5106085Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5106092Z 
2025-05-07T20:33:08.5106192Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5106328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5106433Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5106535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5107045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5107189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5107554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5107780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5108120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5108225Z     kernel = self.compile(
2025-05-07T20:33:08.5108606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5108786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5108922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5108927Z 
2025-05-07T20:33:08.5109135Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eaaf8b0>
2025-05-07T20:33:08.5109914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5110422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f1585e0>}
2025-05-07T20:33:08.5111180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5111375Z context = <triton._C.libtriton.ir.context object at 0x7f158f0e4fb0>
2025-05-07T20:33:08.5111380Z 
2025-05-07T20:33:08.5111548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5111818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5111979Z                            module_map=module_map)
2025-05-07T20:33:08.5112150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5112251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5112331Z E       ^
2025-05-07T20:33:08.5112694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5112699Z 
2025-05-07T20:33:08.5113115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5113162Z 
2025-05-07T20:33:08.5113275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5113506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5113588Z     T=4096,
2025-05-07T20:33:08.5113673Z     D=7168,
2025-05-07T20:33:08.5113760Z     scale_ub=1200.0,
2025-05-07T20:33:08.5113850Z     contiguous=False,
2025-05-07T20:33:08.5113943Z     compiled=False,
2025-05-07T20:33:08.5114025Z )
2025-05-07T20:33:08.5114248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5114435Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5114440Z 
2025-05-07T20:33:08.5114519Z     @given(
2025-05-07T20:33:08.5114679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5114789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5114910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5115038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5115158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5115234Z     )
2025-05-07T20:33:08.5115488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5115583Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5115663Z         self,
2025-05-07T20:33:08.5115748Z         T: int,
2025-05-07T20:33:08.5115874Z         D: int,
2025-05-07T20:33:08.5115975Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5116075Z         contiguous: bool,
2025-05-07T20:33:08.5116164Z         compiled: bool,
2025-05-07T20:33:08.5116244Z     ) -> None:
2025-05-07T20:33:08.5116347Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5116421Z     
2025-05-07T20:33:08.5116604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5116681Z     
2025-05-07T20:33:08.5116777Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5116912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5117004Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5117088Z         x0 = x[:, :D]
2025-05-07T20:33:08.5117179Z         x1 = x[:, D:]
2025-05-07T20:33:08.5117253Z     
2025-05-07T20:33:08.5117339Z         if contiguous:
2025-05-07T20:33:08.5117438Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5117531Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5117610Z     
2025-05-07T20:33:08.5117711Z         if scale_ub is not None:
2025-05-07T20:33:08.5117821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5117965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5118046Z             )
2025-05-07T20:33:08.5118125Z         else:
2025-05-07T20:33:08.5118231Z             scale_ub_tensor = None
2025-05-07T20:33:08.5118308Z     
2025-05-07T20:33:08.5118441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5118544Z             op = silu_mul_quant
2025-05-07T20:33:08.5118631Z             if compiled:
2025-05-07T20:33:08.5118733Z                 op = torch.compile(op)
2025-05-07T20:33:08.5118847Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5118922Z     
2025-05-07T20:33:08.5119015Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5119020Z 
2025-05-07T20:33:08.5119131Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5119337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5119453Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5119556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5120049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5120155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5120518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5120788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5121132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5121227Z     kernel = self.compile(
2025-05-07T20:33:08.5121615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5121796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5121926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5121931Z 
2025-05-07T20:33:08.5122146Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ed12580>
2025-05-07T20:33:08.5122952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5123468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158eb0d1f0>}
2025-05-07T20:33:08.5124207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5124441Z context = <triton._C.libtriton.ir.context object at 0x7f158eb27370>
2025-05-07T20:33:08.5124452Z 
2025-05-07T20:33:08.5124621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5124886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5125004Z                            module_map=module_map)
2025-05-07T20:33:08.5125171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5125275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5125362Z E       ^
2025-05-07T20:33:08.5125718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5125723Z 
2025-05-07T20:33:08.5126140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5126145Z 
2025-05-07T20:33:08.5126249Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5126478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5126570Z     T=16384,
2025-05-07T20:33:08.5126649Z     D=7168,
2025-05-07T20:33:08.5126735Z     scale_ub=None,
2025-05-07T20:33:08.5126829Z     contiguous=True,
2025-05-07T20:33:08.5126915Z     compiled=True,
2025-05-07T20:33:08.5126995Z )
2025-05-07T20:33:08.5127222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5127402Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.5127406Z 
2025-05-07T20:33:08.5127494Z     @given(
2025-05-07T20:33:08.5127615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5127717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5127847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5127965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5128124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5128213Z     )
2025-05-07T20:33:08.5128461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5128563Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5128644Z         self,
2025-05-07T20:33:08.5128724Z         T: int,
2025-05-07T20:33:08.5128812Z         D: int,
2025-05-07T20:33:08.5128912Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5129003Z         contiguous: bool,
2025-05-07T20:33:08.5129137Z         compiled: bool,
2025-05-07T20:33:08.5129216Z     ) -> None:
2025-05-07T20:33:08.5129313Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5129394Z     
2025-05-07T20:33:08.5129563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5129639Z     
2025-05-07T20:33:08.5129740Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5129868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5129963Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5130057Z         x0 = x[:, :D]
2025-05-07T20:33:08.5130139Z         x1 = x[:, D:]
2025-05-07T20:33:08.5130221Z     
2025-05-07T20:33:08.5130309Z         if contiguous:
2025-05-07T20:33:08.5130402Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5130499Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5130576Z     
2025-05-07T20:33:08.5130707Z         if scale_ub is not None:
2025-05-07T20:33:08.5130822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5130965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5131043Z             )
2025-05-07T20:33:08.5131130Z         else:
2025-05-07T20:33:08.5131227Z             scale_ub_tensor = None
2025-05-07T20:33:08.5131302Z     
2025-05-07T20:33:08.5131442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5131537Z             op = silu_mul_quant
2025-05-07T20:33:08.5131635Z             if compiled:
2025-05-07T20:33:08.5131736Z                 op = torch.compile(op)
2025-05-07T20:33:08.5131893Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5131974Z     
2025-05-07T20:33:08.5132069Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5132073Z 
2025-05-07T20:33:08.5132172Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5132309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5132415Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5132516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5132889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5132984Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5133481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5133580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5133938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5134176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5134509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5134614Z     kernel = self.compile(
2025-05-07T20:33:08.5134991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5135174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5135306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5135310Z 
2025-05-07T20:33:08.5135516Z self = <triton.compiler.compiler.ASTSource object at 0x7f158eb33520>
2025-05-07T20:33:08.5136338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5136862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158eb0dee0>}
2025-05-07T20:33:08.5137616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5137854Z context = <triton._C.libtriton.ir.context object at 0x7f158f29f6b0>
2025-05-07T20:33:08.5137859Z 
2025-05-07T20:33:08.5138025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5138295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5138403Z                            module_map=module_map)
2025-05-07T20:33:08.5144076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5144213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5144311Z E       ^
2025-05-07T20:33:08.5144680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5144685Z 
2025-05-07T20:33:08.5145313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5145318Z 
2025-05-07T20:33:08.5145441Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5145668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5145750Z     T=4096,
2025-05-07T20:33:08.5145843Z     D=5120,
2025-05-07T20:33:08.5145930Z     scale_ub=None,
2025-05-07T20:33:08.5146027Z     contiguous=False,
2025-05-07T20:33:08.5146114Z     compiled=True,
2025-05-07T20:33:08.5146192Z )
2025-05-07T20:33:08.5146420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5146681Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5146686Z 
2025-05-07T20:33:08.5146768Z     @given(
2025-05-07T20:33:08.5146901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5147006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5147129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5147258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5147381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5147469Z     )
2025-05-07T20:33:08.5147719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5147818Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5147908Z         self,
2025-05-07T20:33:08.5147992Z         T: int,
2025-05-07T20:33:08.5148074Z         D: int,
2025-05-07T20:33:08.5148182Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5148275Z         contiguous: bool,
2025-05-07T20:33:08.5148369Z         compiled: bool,
2025-05-07T20:33:08.5148460Z     ) -> None:
2025-05-07T20:33:08.5148559Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5148640Z     
2025-05-07T20:33:08.5148821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5148900Z     
2025-05-07T20:33:08.5149010Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5149139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5149236Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5149329Z         x0 = x[:, :D]
2025-05-07T20:33:08.5149415Z         x1 = x[:, D:]
2025-05-07T20:33:08.5149491Z     
2025-05-07T20:33:08.5149587Z         if contiguous:
2025-05-07T20:33:08.5149682Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5149774Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5149862Z     
2025-05-07T20:33:08.5149957Z         if scale_ub is not None:
2025-05-07T20:33:08.5150066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5150289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5150375Z             )
2025-05-07T20:33:08.5150468Z         else:
2025-05-07T20:33:08.5150565Z             scale_ub_tensor = None
2025-05-07T20:33:08.5150643Z     
2025-05-07T20:33:08.5150785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5150881Z             op = silu_mul_quant
2025-05-07T20:33:08.5150969Z             if compiled:
2025-05-07T20:33:08.5151079Z                 op = torch.compile(op)
2025-05-07T20:33:08.5151251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5151329Z     
2025-05-07T20:33:08.5151431Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5151435Z 
2025-05-07T20:33:08.5151536Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5151668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5151783Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5151885Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5152269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5152365Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5152900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5153014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5153370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5153605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5153948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5154049Z     kernel = self.compile(
2025-05-07T20:33:08.5154432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5154685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5154815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5154819Z 
2025-05-07T20:33:08.5155032Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2b70d0>
2025-05-07T20:33:08.5155808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5156325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158f2b9940>}
2025-05-07T20:33:08.5157079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5157283Z context = <triton._C.libtriton.ir.context object at 0x7f158ee396f0>
2025-05-07T20:33:08.5157288Z 
2025-05-07T20:33:08.5157460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5157724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5157841Z                            module_map=module_map)
2025-05-07T20:33:08.5158008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5158108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5158195Z E       ^
2025-05-07T20:33:08.5158552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5158557Z 
2025-05-07T20:33:08.5158981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5158988Z 
2025-05-07T20:33:08.5159135Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5159360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5159448Z     T=4096,
2025-05-07T20:33:08.5159528Z     D=5120,
2025-05-07T20:33:08.5159612Z     scale_ub=1200.0,
2025-05-07T20:33:08.5159708Z     contiguous=False,
2025-05-07T20:33:08.5159799Z     compiled=False,
2025-05-07T20:33:08.5159883Z )
2025-05-07T20:33:08.5160102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5160320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5160324Z 
2025-05-07T20:33:08.5160416Z     @given(
2025-05-07T20:33:08.5160538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5160641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5160768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5160888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5161015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5161101Z     )
2025-05-07T20:33:08.5161348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5161452Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5161533Z         self,
2025-05-07T20:33:08.5161652Z         T: int,
2025-05-07T20:33:08.5161742Z         D: int,
2025-05-07T20:33:08.5161845Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5161940Z         contiguous: bool,
2025-05-07T20:33:08.5162036Z         compiled: bool,
2025-05-07T20:33:08.5162118Z     ) -> None:
2025-05-07T20:33:08.5162217Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5162302Z     
2025-05-07T20:33:08.5162474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5162551Z     
2025-05-07T20:33:08.5162653Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5162785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5162930Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5163014Z         x0 = x[:, :D]
2025-05-07T20:33:08.5163099Z         x1 = x[:, D:]
2025-05-07T20:33:08.5163182Z     
2025-05-07T20:33:08.5163270Z         if contiguous:
2025-05-07T20:33:08.5163368Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5163468Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5163550Z     
2025-05-07T20:33:08.5163648Z         if scale_ub is not None:
2025-05-07T20:33:08.5163765Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5163905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5163986Z             )
2025-05-07T20:33:08.5164075Z         else:
2025-05-07T20:33:08.5164171Z             scale_ub_tensor = None
2025-05-07T20:33:08.5164246Z     
2025-05-07T20:33:08.5164390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5164485Z             op = silu_mul_quant
2025-05-07T20:33:08.5164582Z             if compiled:
2025-05-07T20:33:08.5164692Z                 op = torch.compile(op)
2025-05-07T20:33:08.5164799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5164882Z     
2025-05-07T20:33:08.5164976Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5164981Z 
2025-05-07T20:33:08.5165082Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5165223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5165328Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5165436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5165949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5166049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5166423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5166648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5167037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5167143Z     kernel = self.compile(
2025-05-07T20:33:08.5167527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5167720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5167849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5167889Z 
2025-05-07T20:33:08.5168103Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2bb7c0>
2025-05-07T20:33:08.5168935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5169445Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec343a0>}
2025-05-07T20:33:08.5170245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5170441Z context = <triton._C.libtriton.ir.context object at 0x7f158ec23630>
2025-05-07T20:33:08.5170448Z 
2025-05-07T20:33:08.5170619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5170897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5171009Z                            module_map=module_map)
2025-05-07T20:33:08.5171181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5171283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5171365Z E       ^
2025-05-07T20:33:08.5171774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5171779Z 
2025-05-07T20:33:08.5172197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5172202Z 
2025-05-07T20:33:08.5172317Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5172542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5172625Z     T=4096,
2025-05-07T20:33:08.5172710Z     D=5120,
2025-05-07T20:33:08.5172795Z     scale_ub=1200.0,
2025-05-07T20:33:08.5172884Z     contiguous=False,
2025-05-07T20:33:08.5172979Z     compiled=True,
2025-05-07T20:33:08.5173054Z )
2025-05-07T20:33:08.5173272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5173453Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5173458Z 
2025-05-07T20:33:08.5173545Z     @given(
2025-05-07T20:33:08.5173674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5173778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5173900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5174029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5174145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5174222Z     )
2025-05-07T20:33:08.5174476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5174575Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5174655Z         self,
2025-05-07T20:33:08.5174744Z         T: int,
2025-05-07T20:33:08.5174828Z         D: int,
2025-05-07T20:33:08.5174932Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5175032Z         contiguous: bool,
2025-05-07T20:33:08.5175121Z         compiled: bool,
2025-05-07T20:33:08.5175212Z     ) -> None:
2025-05-07T20:33:08.5175355Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5175436Z     
2025-05-07T20:33:08.5175619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5175699Z     
2025-05-07T20:33:08.5175794Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5175932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5176028Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5176113Z         x0 = x[:, :D]
2025-05-07T20:33:08.5176204Z         x1 = x[:, D:]
2025-05-07T20:33:08.5176320Z     
2025-05-07T20:33:08.5176409Z         if contiguous:
2025-05-07T20:33:08.5176512Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5176607Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5176690Z     
2025-05-07T20:33:08.5176783Z         if scale_ub is not None:
2025-05-07T20:33:08.5176893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5177042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5177123Z             )
2025-05-07T20:33:08.5177211Z         else:
2025-05-07T20:33:08.5177320Z             scale_ub_tensor = None
2025-05-07T20:33:08.5177396Z     
2025-05-07T20:33:08.5177535Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5177639Z             op = silu_mul_quant
2025-05-07T20:33:08.5177729Z             if compiled:
2025-05-07T20:33:08.5177871Z                 op = torch.compile(op)
2025-05-07T20:33:08.5177990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5178070Z     
2025-05-07T20:33:08.5178172Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5178176Z 
2025-05-07T20:33:08.5178283Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5178414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5178517Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5178633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5179003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5179143Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5179653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5179752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5180118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5180342Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5180681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5180784Z     kernel = self.compile(
2025-05-07T20:33:08.5181307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5181494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5181628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5181635Z 
2025-05-07T20:33:08.5181845Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec39fd0>
2025-05-07T20:33:08.5182623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5183137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec34280>}
2025-05-07T20:33:08.5183886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5184079Z context = <triton._C.libtriton.ir.context object at 0x7f158ed4e5b0>
2025-05-07T20:33:08.5184132Z 
2025-05-07T20:33:08.5184304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5184579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5184690Z                            module_map=module_map)
2025-05-07T20:33:08.5184865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5184965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5185085Z E       ^
2025-05-07T20:33:08.5185454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5185459Z 
2025-05-07T20:33:08.5185876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5185881Z 
2025-05-07T20:33:08.5185993Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5186221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5186303Z     T=2048,
2025-05-07T20:33:08.5186388Z     D=7168,
2025-05-07T20:33:08.5186475Z     scale_ub=1200.0,
2025-05-07T20:33:08.5186564Z     contiguous=False,
2025-05-07T20:33:08.5186661Z     compiled=False,
2025-05-07T20:33:08.5186736Z )
2025-05-07T20:33:08.5187028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5187219Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5187226Z 
2025-05-07T20:33:08.5187308Z     @given(
2025-05-07T20:33:08.5187434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5187536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5187653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5187778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5187893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5187970Z     )
2025-05-07T20:33:08.5188266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5188364Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5188449Z         self,
2025-05-07T20:33:08.5188535Z         T: int,
2025-05-07T20:33:08.5188614Z         D: int,
2025-05-07T20:33:08.5188715Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5188815Z         contiguous: bool,
2025-05-07T20:33:08.5188903Z         compiled: bool,
2025-05-07T20:33:08.5188991Z     ) -> None:
2025-05-07T20:33:08.5189091Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5189165Z     
2025-05-07T20:33:08.5189344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5189420Z     
2025-05-07T20:33:08.5189514Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5189647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5189739Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5189822Z         x0 = x[:, :D]
2025-05-07T20:33:08.5189912Z         x1 = x[:, D:]
2025-05-07T20:33:08.5189991Z     
2025-05-07T20:33:08.5190079Z         if contiguous:
2025-05-07T20:33:08.5190179Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5190272Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5190356Z     
2025-05-07T20:33:08.5190449Z         if scale_ub is not None:
2025-05-07T20:33:08.5190560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5190707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5190788Z             )
2025-05-07T20:33:08.5190870Z         else:
2025-05-07T20:33:08.5190974Z             scale_ub_tensor = None
2025-05-07T20:33:08.5191048Z     
2025-05-07T20:33:08.5191182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5191281Z             op = silu_mul_quant
2025-05-07T20:33:08.5191369Z             if compiled:
2025-05-07T20:33:08.5191470Z                 op = torch.compile(op)
2025-05-07T20:33:08.5191585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5191710Z     
2025-05-07T20:33:08.5191815Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5191820Z 
2025-05-07T20:33:08.5191919Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5192049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5192161Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5192266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5192767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5192937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5193294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5193525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5193863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5193965Z     kernel = self.compile(
2025-05-07T20:33:08.5194348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5194526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5194691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5194704Z 
2025-05-07T20:33:08.5194914Z self = <triton.compiler.compiler.ASTSource object at 0x7f158f2c3370>
2025-05-07T20:33:08.5195687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5196206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ed85670>}
2025-05-07T20:33:08.5197003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5197205Z context = <triton._C.libtriton.ir.context object at 0x7f158ece87b0>
2025-05-07T20:33:08.5197211Z 
2025-05-07T20:33:08.5197378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5197646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5197765Z                            module_map=module_map)
2025-05-07T20:33:08.5197929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5198029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5198117Z E       ^
2025-05-07T20:33:08.5198476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5198486Z 
2025-05-07T20:33:08.5198955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5198960Z 
2025-05-07T20:33:08.5199064Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5199290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5199378Z     T=1,
2025-05-07T20:33:08.5199457Z     D=7168,
2025-05-07T20:33:08.5199547Z     scale_ub=None,
2025-05-07T20:33:08.5199640Z     contiguous=True,
2025-05-07T20:33:08.5199727Z     compiled=False,
2025-05-07T20:33:08.5199807Z )
2025-05-07T20:33:08.5200025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5200193Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5200198Z 
2025-05-07T20:33:08.5200284Z     @given(
2025-05-07T20:33:08.5200406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5200553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5200679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5200801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5200924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5201002Z     )
2025-05-07T20:33:08.5201251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5201352Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5201468Z         self,
2025-05-07T20:33:08.5201549Z         T: int,
2025-05-07T20:33:08.5201634Z         D: int,
2025-05-07T20:33:08.5201737Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5201831Z         contiguous: bool,
2025-05-07T20:33:08.5201925Z         compiled: bool,
2025-05-07T20:33:08.5202006Z     ) -> None:
2025-05-07T20:33:08.5202105Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5202186Z     
2025-05-07T20:33:08.5202362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5202446Z     
2025-05-07T20:33:08.5202546Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5202673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5202772Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5202855Z         x0 = x[:, :D]
2025-05-07T20:33:08.5202937Z         x1 = x[:, D:]
2025-05-07T20:33:08.5203056Z     
2025-05-07T20:33:08.5203144Z         if contiguous:
2025-05-07T20:33:08.5203238Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5203337Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5203412Z     
2025-05-07T20:33:08.5203504Z         if scale_ub is not None:
2025-05-07T20:33:08.5203618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5203757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5203835Z             )
2025-05-07T20:33:08.5203934Z         else:
2025-05-07T20:33:08.5204031Z             scale_ub_tensor = None
2025-05-07T20:33:08.5204106Z     
2025-05-07T20:33:08.5204293Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5204388Z             op = silu_mul_quant
2025-05-07T20:33:08.5204480Z             if compiled:
2025-05-07T20:33:08.5204589Z                 op = torch.compile(op)
2025-05-07T20:33:08.5204699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5204775Z     
2025-05-07T20:33:08.5204881Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5204885Z 
2025-05-07T20:33:08.5204986Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5205130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5205237Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5205341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5205842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5205944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5206302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5206537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5206874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5206977Z     kernel = self.compile(
2025-05-07T20:33:08.5207354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5207535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5207672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5207676Z 
2025-05-07T20:33:08.5207882Z self = <triton.compiler.compiler.ASTSource object at 0x7f158edb0d30>
2025-05-07T20:33:08.5208705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5209221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec5f280>}
2025-05-07T20:33:08.5209968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5210208Z context = <triton._C.libtriton.ir.context object at 0x7f158ec601f0>
2025-05-07T20:33:08.5210213Z 
2025-05-07T20:33:08.5210381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5210652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5210762Z                            module_map=module_map)
2025-05-07T20:33:08.5210934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5211041Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5211120Z E       ^
2025-05-07T20:33:08.5211484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5211489Z 
2025-05-07T20:33:08.5211935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5211943Z 
2025-05-07T20:33:08.5212049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5212277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5212358Z     T=16384,
2025-05-07T20:33:08.5212436Z     D=7168,
2025-05-07T20:33:08.5212525Z     scale_ub=1200.0,
2025-05-07T20:33:08.5212613Z     contiguous=False,
2025-05-07T20:33:08.5212704Z     compiled=True,
2025-05-07T20:33:08.5212779Z )
2025-05-07T20:33:08.5213001Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5213234Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5213238Z 
2025-05-07T20:33:08.5213318Z     @given(
2025-05-07T20:33:08.5213438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5213553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5213670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5213788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5213912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5213988Z     )
2025-05-07T20:33:08.5214241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5214338Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5214416Z         self,
2025-05-07T20:33:08.5214501Z         T: int,
2025-05-07T20:33:08.5214580Z         D: int,
2025-05-07T20:33:08.5214680Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5214784Z         contiguous: bool,
2025-05-07T20:33:08.5214874Z         compiled: bool,
2025-05-07T20:33:08.5214954Z     ) -> None:
2025-05-07T20:33:08.5215058Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5215133Z     
2025-05-07T20:33:08.5215304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5215388Z     
2025-05-07T20:33:08.5215483Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5215615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5215710Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5215793Z         x0 = x[:, :D]
2025-05-07T20:33:08.5215882Z         x1 = x[:, D:]
2025-05-07T20:33:08.5215958Z     
2025-05-07T20:33:08.5216044Z         if contiguous:
2025-05-07T20:33:08.5216143Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5216234Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5216308Z     
2025-05-07T20:33:08.5216410Z         if scale_ub is not None:
2025-05-07T20:33:08.5216569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5216711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5216796Z             )
2025-05-07T20:33:08.5216876Z         else:
2025-05-07T20:33:08.5216978Z             scale_ub_tensor = None
2025-05-07T20:33:08.5217052Z     
2025-05-07T20:33:08.5217190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5217288Z             op = silu_mul_quant
2025-05-07T20:33:08.5217448Z             if compiled:
2025-05-07T20:33:08.5217552Z                 op = torch.compile(op)
2025-05-07T20:33:08.5217668Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5217743Z     
2025-05-07T20:33:08.5217836Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5217841Z 
2025-05-07T20:33:08.5217948Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5218081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5218183Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5218302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5218675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5218778Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5219307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5219413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5219781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5220007Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5220352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5220452Z     kernel = self.compile(
2025-05-07T20:33:08.5220832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5221180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5221311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5221316Z 
2025-05-07T20:33:08.5221524Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec82a00>
2025-05-07T20:33:08.5222317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5222824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ec5fee0>}
2025-05-07T20:33:08.5223572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5223765Z context = <triton._C.libtriton.ir.context object at 0x7f158eca40f0>
2025-05-07T20:33:08.5223770Z 
2025-05-07T20:33:08.5223944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5224211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5224322Z                            module_map=module_map)
2025-05-07T20:33:08.5224495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5224595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5224675Z E       ^
2025-05-07T20:33:08.5225038Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5225043Z 
2025-05-07T20:33:08.5225498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5225506Z 
2025-05-07T20:33:08.5225623Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5225847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5225926Z     T=1,
2025-05-07T20:33:08.5226011Z     D=7168,
2025-05-07T20:33:08.5226098Z     scale_ub=None,
2025-05-07T20:33:08.5226189Z     contiguous=False,
2025-05-07T20:33:08.5226287Z     compiled=False,
2025-05-07T20:33:08.5226363Z )
2025-05-07T20:33:08.5226628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5226799Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5226804Z 
2025-05-07T20:33:08.5226884Z     @given(
2025-05-07T20:33:08.5227009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5227113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5227229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5227358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5227479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5227950Z     )
2025-05-07T20:33:08.5228197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5228294Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5228423Z         self,
2025-05-07T20:33:08.5228503Z         T: int,
2025-05-07T20:33:08.5228582Z         D: int,
2025-05-07T20:33:08.5228692Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5228785Z         contiguous: bool,
2025-05-07T20:33:08.5228873Z         compiled: bool,
2025-05-07T20:33:08.5228959Z     ) -> None:
2025-05-07T20:33:08.5229055Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5229131Z     
2025-05-07T20:33:08.5229312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5229388Z     
2025-05-07T20:33:08.5229489Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5229618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5229755Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5229846Z         x0 = x[:, :D]
2025-05-07T20:33:08.5229928Z         x1 = x[:, D:]
2025-05-07T20:33:08.5230002Z     
2025-05-07T20:33:08.5230095Z         if contiguous:
2025-05-07T20:33:08.5230188Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5230282Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5230365Z     
2025-05-07T20:33:08.5230457Z         if scale_ub is not None:
2025-05-07T20:33:08.5230566Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5230710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5230788Z             )
2025-05-07T20:33:08.5230867Z         else:
2025-05-07T20:33:08.5230968Z             scale_ub_tensor = None
2025-05-07T20:33:08.5231042Z     
2025-05-07T20:33:08.5231180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5231278Z             op = silu_mul_quant
2025-05-07T20:33:08.5231370Z             if compiled:
2025-05-07T20:33:08.5231477Z                 op = torch.compile(op)
2025-05-07T20:33:08.5231587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5231660Z     
2025-05-07T20:33:08.5231759Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5231764Z 
2025-05-07T20:33:08.5231862Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5231993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5232104Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5232207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5232715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5232813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5233169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5233445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5233796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5233892Z     kernel = self.compile(
2025-05-07T20:33:08.5234279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5234460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5234631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5234635Z 
2025-05-07T20:33:08.5234846Z self = <triton.compiler.compiler.ASTSource object at 0x7f158ec81a00>
2025-05-07T20:33:08.5235617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5236133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ecb6670>}
2025-05-07T20:33:08.5236909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5237116Z context = <triton._C.libtriton.ir.context object at 0x7f158e99ea30>
2025-05-07T20:33:08.5237122Z 
2025-05-07T20:33:08.5237289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5237560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5237671Z                            module_map=module_map)
2025-05-07T20:33:08.5237834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5237942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5238065Z E       ^
2025-05-07T20:33:08.5238420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5238424Z 
2025-05-07T20:33:08.5238841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5238848Z 
2025-05-07T20:33:08.5238952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5239181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5239262Z     T=2048,
2025-05-07T20:33:08.5239341Z     D=7168,
2025-05-07T20:33:08.5239431Z     scale_ub=None,
2025-05-07T20:33:08.5239520Z     contiguous=False,
2025-05-07T20:33:08.5239608Z     compiled=True,
2025-05-07T20:33:08.5239690Z )
2025-05-07T20:33:08.5239910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5240363Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5240395Z 
2025-05-07T20:33:08.5240513Z     @given(
2025-05-07T20:33:08.5240677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5240826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5240959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5241083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5241205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5241285Z     )
2025-05-07T20:33:08.5241533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5241637Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5241717Z         self,
2025-05-07T20:33:08.5241795Z         T: int,
2025-05-07T20:33:08.5241879Z         D: int,
2025-05-07T20:33:08.5241977Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5242075Z         contiguous: bool,
2025-05-07T20:33:08.5242161Z         compiled: bool,
2025-05-07T20:33:08.5242242Z     ) -> None:
2025-05-07T20:33:08.5242513Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5242592Z     
2025-05-07T20:33:08.5242766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5242847Z     
2025-05-07T20:33:08.5242940Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5243068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5243163Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5243245Z         x0 = x[:, :D]
2025-05-07T20:33:08.5243392Z         x1 = x[:, D:]
2025-05-07T20:33:08.5243472Z     
2025-05-07T20:33:08.5243557Z         if contiguous:
2025-05-07T20:33:08.5243655Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5243745Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5243821Z     
2025-05-07T20:33:08.5243921Z         if scale_ub is not None:
2025-05-07T20:33:08.5244027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5244165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5244258Z             )
2025-05-07T20:33:08.5244338Z         else:
2025-05-07T20:33:08.5244434Z             scale_ub_tensor = None
2025-05-07T20:33:08.5244515Z     
2025-05-07T20:33:08.5244647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5244740Z             op = silu_mul_quant
2025-05-07T20:33:08.5244897Z             if compiled:
2025-05-07T20:33:08.5245001Z                 op = torch.compile(op)
2025-05-07T20:33:08.5245117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5245198Z     
2025-05-07T20:33:08.5245292Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5245296Z 
2025-05-07T20:33:08.5245403Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5245532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5245634Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5245742Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5246118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5246280Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5246778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5246878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5247246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5247474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5247814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5247915Z     kernel = self.compile(
2025-05-07T20:33:08.5248291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5248474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5248607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5248611Z 
2025-05-07T20:33:08.5248818Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e9d27c0>
2025-05-07T20:33:08.5249596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5250106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8c1550>}
2025-05-07T20:33:08.5250851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5251117Z context = <triton._C.libtriton.ir.context object at 0x7f158e8a3db0>
2025-05-07T20:33:08.5251125Z 
2025-05-07T20:33:08.5251298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5251575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5251687Z                            module_map=module_map)
2025-05-07T20:33:08.5251857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5251957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5252080Z E       ^
2025-05-07T20:33:08.5252443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5252447Z 
2025-05-07T20:33:08.5252856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5252860Z 
2025-05-07T20:33:08.5252972Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5253203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5253282Z     T=4096,
2025-05-07T20:33:08.5253368Z     D=7168,
2025-05-07T20:33:08.5253454Z     scale_ub=None,
2025-05-07T20:33:08.5253543Z     contiguous=False,
2025-05-07T20:33:08.5253634Z     compiled=True,
2025-05-07T20:33:08.5253710Z )
2025-05-07T20:33:08.5253967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5254155Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5254162Z 
2025-05-07T20:33:08.5254242Z     @given(
2025-05-07T20:33:08.5254370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5254474Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5254591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5254714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5254830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5254951Z     )
2025-05-07T20:33:08.5255205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5255301Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5255378Z         self,
2025-05-07T20:33:08.5255465Z         T: int,
2025-05-07T20:33:08.5255544Z         D: int,
2025-05-07T20:33:08.5255649Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5255749Z         contiguous: bool,
2025-05-07T20:33:08.5255838Z         compiled: bool,
2025-05-07T20:33:08.5255928Z     ) -> None:
2025-05-07T20:33:08.5256025Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5256100Z     
2025-05-07T20:33:08.5256278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5256357Z     
2025-05-07T20:33:08.5256452Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5256587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5256679Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5256764Z         x0 = x[:, :D]
2025-05-07T20:33:08.5256858Z         x1 = x[:, D:]
2025-05-07T20:33:08.5256933Z     
2025-05-07T20:33:08.5257018Z         if contiguous:
2025-05-07T20:33:08.5257119Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5257210Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5257283Z     
2025-05-07T20:33:08.5257382Z         if scale_ub is not None:
2025-05-07T20:33:08.5257492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5257638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5257719Z             )
2025-05-07T20:33:08.5257800Z         else:
2025-05-07T20:33:08.5257903Z             scale_ub_tensor = None
2025-05-07T20:33:08.5257977Z     
2025-05-07T20:33:08.5258109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5258207Z             op = silu_mul_quant
2025-05-07T20:33:08.5258294Z             if compiled:
2025-05-07T20:33:08.5258396Z                 op = torch.compile(op)
2025-05-07T20:33:08.5258566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5258660Z     
2025-05-07T20:33:08.5258773Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5258785Z 
2025-05-07T20:33:08.5258891Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5259021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5259131Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5259235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5259601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5259744Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5260236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5260342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5260701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5260934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5261386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5261482Z     kernel = self.compile(
2025-05-07T20:33:08.5261901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5262092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5262222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5262227Z 
2025-05-07T20:33:08.5262439Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e8afca0>
2025-05-07T20:33:08.5263225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5263775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e92b160>}
2025-05-07T20:33:08.5264538Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5264733Z context = <triton._C.libtriton.ir.context object at 0x7f158e90edf0>
2025-05-07T20:33:08.5264737Z 
2025-05-07T20:33:08.5264913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5265182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5265292Z                            module_map=module_map)
2025-05-07T20:33:08.5265459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5265564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5265649Z E       ^
2025-05-07T20:33:08.5266001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5266006Z 
2025-05-07T20:33:08.5266419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5266424Z 
2025-05-07T20:33:08.5266533Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5266758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5266843Z     T=16384,
2025-05-07T20:33:08.5266922Z     D=5120,
2025-05-07T20:33:08.5272173Z     scale_ub=1200.0,
2025-05-07T20:33:08.5272295Z     contiguous=False,
2025-05-07T20:33:08.5272386Z     compiled=False,
2025-05-07T20:33:08.5272475Z )
2025-05-07T20:33:08.5272706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5272994Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5273003Z 
2025-05-07T20:33:08.5273087Z     @given(
2025-05-07T20:33:08.5273213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5273327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5273451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5273577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5273703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5273826Z     )
2025-05-07T20:33:08.5274078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5274185Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5274266Z         self,
2025-05-07T20:33:08.5274356Z         T: int,
2025-05-07T20:33:08.5274436Z         D: int,
2025-05-07T20:33:08.5274539Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5274641Z         contiguous: bool,
2025-05-07T20:33:08.5274740Z         compiled: bool,
2025-05-07T20:33:08.5274822Z     ) -> None:
2025-05-07T20:33:08.5274931Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5275010Z     
2025-05-07T20:33:08.5275183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5275269Z     
2025-05-07T20:33:08.5275366Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5275536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5275640Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5275729Z         x0 = x[:, :D]
2025-05-07T20:33:08.5275813Z         x1 = x[:, D:]
2025-05-07T20:33:08.5275897Z     
2025-05-07T20:33:08.5275985Z         if contiguous:
2025-05-07T20:33:08.5276087Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5276180Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5276257Z     
2025-05-07T20:33:08.5276362Z         if scale_ub is not None:
2025-05-07T20:33:08.5276471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5276619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5276748Z             )
2025-05-07T20:33:08.5276829Z         else:
2025-05-07T20:33:08.5276927Z             scale_ub_tensor = None
2025-05-07T20:33:08.5277013Z     
2025-05-07T20:33:08.5277150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5277246Z             op = silu_mul_quant
2025-05-07T20:33:08.5277342Z             if compiled:
2025-05-07T20:33:08.5277447Z                 op = torch.compile(op)
2025-05-07T20:33:08.5277566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5277644Z     
2025-05-07T20:33:08.5277739Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5277744Z 
2025-05-07T20:33:08.5277853Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5277984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5278087Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5278197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5278713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5278819Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5279178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5279410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5279765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5279866Z     kernel = self.compile(
2025-05-07T20:33:08.5280244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5280429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5280558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5280565Z 
2025-05-07T20:33:08.5280823Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7d3250>
2025-05-07T20:33:08.5281614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5282133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e92b940>}
2025-05-07T20:33:08.5282924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5283120Z context = <triton._C.libtriton.ir.context object at 0x7f158e7ebd30>
2025-05-07T20:33:08.5283125Z 
2025-05-07T20:33:08.5283306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5283574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5283687Z                            module_map=module_map)
2025-05-07T20:33:08.5283860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5283998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5284087Z E       ^
2025-05-07T20:33:08.5284443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5284451Z 
2025-05-07T20:33:08.5284869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5284873Z 
2025-05-07T20:33:08.5284989Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5285213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5285301Z     T=16384,
2025-05-07T20:33:08.5285453Z     D=5120,
2025-05-07T20:33:08.5285539Z     scale_ub=1200.0,
2025-05-07T20:33:08.5285636Z     contiguous=True,
2025-05-07T20:33:08.5285723Z     compiled=True,
2025-05-07T20:33:08.5285798Z )
2025-05-07T20:33:08.5286025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5286206Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5286210Z 
2025-05-07T20:33:08.5286290Z     @given(
2025-05-07T20:33:08.5286420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5286524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5286649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5286771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5286889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5286973Z     )
2025-05-07T20:33:08.5287221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5287324Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5287413Z         self,
2025-05-07T20:33:08.5287494Z         T: int,
2025-05-07T20:33:08.5287575Z         D: int,
2025-05-07T20:33:08.5287684Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5287777Z         contiguous: bool,
2025-05-07T20:33:08.5287867Z         compiled: bool,
2025-05-07T20:33:08.5287959Z     ) -> None:
2025-05-07T20:33:08.5288058Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5288144Z     
2025-05-07T20:33:08.5288319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5288398Z     
2025-05-07T20:33:08.5288501Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5288631Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5288725Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5288817Z         x0 = x[:, :D]
2025-05-07T20:33:08.5288901Z         x1 = x[:, D:]
2025-05-07T20:33:08.5288977Z     
2025-05-07T20:33:08.5289073Z         if contiguous:
2025-05-07T20:33:08.5289215Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5289309Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5289394Z     
2025-05-07T20:33:08.5289488Z         if scale_ub is not None:
2025-05-07T20:33:08.5289596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5289748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5289829Z             )
2025-05-07T20:33:08.5289917Z         else:
2025-05-07T20:33:08.5290054Z             scale_ub_tensor = None
2025-05-07T20:33:08.5290133Z     
2025-05-07T20:33:08.5290276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5290370Z             op = silu_mul_quant
2025-05-07T20:33:08.5290458Z             if compiled:
2025-05-07T20:33:08.5290571Z                 op = torch.compile(op)
2025-05-07T20:33:08.5290684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5290764Z     
2025-05-07T20:33:08.5290866Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5290874Z 
2025-05-07T20:33:08.5290978Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5291119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5291223Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5291325Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5291738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5291836Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5292337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5292444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5292801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5293035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5293425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5293521Z     kernel = self.compile(
2025-05-07T20:33:08.5293906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5294088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5294217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5294234Z 
2025-05-07T20:33:08.5294444Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e850760>
2025-05-07T20:33:08.5295214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5295728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158ea18550>}
2025-05-07T20:33:08.5296478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5296688Z context = <triton._C.libtriton.ir.context object at 0x7f158ea42eb0>
2025-05-07T20:33:08.5296693Z 
2025-05-07T20:33:08.5296866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5297139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5297261Z                            module_map=module_map)
2025-05-07T20:33:08.5297425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5297534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5297615Z E       ^
2025-05-07T20:33:08.5298012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5298020Z 
2025-05-07T20:33:08.5298446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5298451Z 
2025-05-07T20:33:08.5298559Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5298783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5298911Z     T=16384,
2025-05-07T20:33:08.5298991Z     D=5120,
2025-05-07T20:33:08.5299084Z     scale_ub=None,
2025-05-07T20:33:08.5299176Z     contiguous=False,
2025-05-07T20:33:08.5299262Z     compiled=True,
2025-05-07T20:33:08.5299344Z )
2025-05-07T20:33:08.5299565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5299746Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5299751Z 
2025-05-07T20:33:08.5299838Z     @given(
2025-05-07T20:33:08.5299967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5300070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5300196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5300319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5300483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5300562Z     )
2025-05-07T20:33:08.5300811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5300915Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5301109Z         self,
2025-05-07T20:33:08.5301189Z         T: int,
2025-05-07T20:33:08.5301276Z         D: int,
2025-05-07T20:33:08.5301377Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5301471Z         contiguous: bool,
2025-05-07T20:33:08.5301566Z         compiled: bool,
2025-05-07T20:33:08.5301647Z     ) -> None:
2025-05-07T20:33:08.5301745Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5301873Z     
2025-05-07T20:33:08.5302051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5302135Z     
2025-05-07T20:33:08.5302229Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5302356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5302456Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5302545Z         x0 = x[:, :D]
2025-05-07T20:33:08.5302629Z         x1 = x[:, D:]
2025-05-07T20:33:08.5302713Z     
2025-05-07T20:33:08.5302803Z         if contiguous:
2025-05-07T20:33:08.5302899Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5303001Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5303077Z     
2025-05-07T20:33:08.5303172Z         if scale_ub is not None:
2025-05-07T20:33:08.5303289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5303429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5303518Z             )
2025-05-07T20:33:08.5303599Z         else:
2025-05-07T20:33:08.5303704Z             scale_ub_tensor = None
2025-05-07T20:33:08.5303793Z     
2025-05-07T20:33:08.5303931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5304025Z             op = silu_mul_quant
2025-05-07T20:33:08.5304123Z             if compiled:
2025-05-07T20:33:08.5304227Z                 op = torch.compile(op)
2025-05-07T20:33:08.5304341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5304425Z     
2025-05-07T20:33:08.5304520Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5304526Z 
2025-05-07T20:33:08.5304626Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5304766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5304870Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5304981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5305347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5305488Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5305998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5306098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5306454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5306694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5307074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5307171Z     kernel = self.compile(
2025-05-07T20:33:08.5307562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5307740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5307872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5307887Z 
2025-05-07T20:33:08.5308093Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e81a4f0>
2025-05-07T20:33:08.5308901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5309421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8fb1f0>}
2025-05-07T20:33:08.5310176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5310374Z context = <triton._C.libtriton.ir.context object at 0x7f158e8d0db0>
2025-05-07T20:33:08.5310379Z 
2025-05-07T20:33:08.5310595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5310865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5310982Z                            module_map=module_map)
2025-05-07T20:33:08.5311150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5311251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5311336Z E       ^
2025-05-07T20:33:08.5311701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5311706Z 
2025-05-07T20:33:08.5312124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5312129Z 
2025-05-07T20:33:08.5312235Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5312460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5312551Z     T=2048,
2025-05-07T20:33:08.5312631Z     D=5120,
2025-05-07T20:33:08.5312724Z     scale_ub=None,
2025-05-07T20:33:08.5312816Z     contiguous=False,
2025-05-07T20:33:08.5312903Z     compiled=True,
2025-05-07T20:33:08.5312986Z )
2025-05-07T20:33:08.5313204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5313386Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5313390Z 
2025-05-07T20:33:08.5313482Z     @given(
2025-05-07T20:33:08.5313604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5313707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5313832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5313950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5314072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5314151Z     )
2025-05-07T20:33:08.5314447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5314555Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5314634Z         self,
2025-05-07T20:33:08.5314713Z         T: int,
2025-05-07T20:33:08.5314802Z         D: int,
2025-05-07T20:33:08.5314902Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5314993Z         contiguous: bool,
2025-05-07T20:33:08.5315090Z         compiled: bool,
2025-05-07T20:33:08.5315171Z     ) -> None:
2025-05-07T20:33:08.5315268Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5315388Z     
2025-05-07T20:33:08.5315565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5315641Z     
2025-05-07T20:33:08.5315743Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5315869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5315967Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5316055Z         x0 = x[:, :D]
2025-05-07T20:33:08.5316137Z         x1 = x[:, D:]
2025-05-07T20:33:08.5316222Z     
2025-05-07T20:33:08.5316311Z         if contiguous:
2025-05-07T20:33:08.5316405Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5316502Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5316578Z     
2025-05-07T20:33:08.5316671Z         if scale_ub is not None:
2025-05-07T20:33:08.5316785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5316992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5317073Z             )
2025-05-07T20:33:08.5317165Z         else:
2025-05-07T20:33:08.5317263Z             scale_ub_tensor = None
2025-05-07T20:33:08.5317343Z     
2025-05-07T20:33:08.5317477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5317571Z             op = silu_mul_quant
2025-05-07T20:33:08.5317665Z             if compiled:
2025-05-07T20:33:08.5317768Z                 op = torch.compile(op)
2025-05-07T20:33:08.5317878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5317962Z     
2025-05-07T20:33:08.5318100Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5318105Z 
2025-05-07T20:33:08.5318205Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5318340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5318443Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5318554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5318919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5319017Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5319523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5319622Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5319977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5320212Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5320555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5320658Z     kernel = self.compile(
2025-05-07T20:33:08.5321036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5321214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5321350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5321356Z 
2025-05-07T20:33:08.5321562Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e776b50>
2025-05-07T20:33:08.5322341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5322886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e8fbf70>}
2025-05-07T20:33:08.5323646Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5323847Z context = <triton._C.libtriton.ir.context object at 0x7f158e74ab30>
2025-05-07T20:33:08.5323887Z 
2025-05-07T20:33:08.5324060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5324330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5324442Z                            module_map=module_map)
2025-05-07T20:33:08.5324607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5324716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5324797Z E       ^
2025-05-07T20:33:08.5325166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5325178Z 
2025-05-07T20:33:08.5325595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5325600Z 
2025-05-07T20:33:08.5325767Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5325999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5326083Z     T=2048,
2025-05-07T20:33:08.5326163Z     D=5120,
2025-05-07T20:33:08.5326261Z     scale_ub=1200.0,
2025-05-07T20:33:08.5326353Z     contiguous=False,
2025-05-07T20:33:08.5326439Z     compiled=True,
2025-05-07T20:33:08.5326523Z )
2025-05-07T20:33:08.5326742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5326928Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5326974Z 
2025-05-07T20:33:08.5327057Z     @given(
2025-05-07T20:33:08.5327179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5327289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5327409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5327531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5327652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5327729Z     )
2025-05-07T20:33:08.5327987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5328085Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5328165Z         self,
2025-05-07T20:33:08.5328256Z         T: int,
2025-05-07T20:33:08.5328338Z         D: int,
2025-05-07T20:33:08.5328439Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5328541Z         contiguous: bool,
2025-05-07T20:33:08.5328644Z         compiled: bool,
2025-05-07T20:33:08.5328735Z     ) -> None:
2025-05-07T20:33:08.5328864Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5328944Z     
2025-05-07T20:33:08.5329117Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5329203Z     
2025-05-07T20:33:08.5329297Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5329424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5329525Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5329609Z         x0 = x[:, :D]
2025-05-07T20:33:08.5329696Z         x1 = x[:, D:]
2025-05-07T20:33:08.5329775Z     
2025-05-07T20:33:08.5329861Z         if contiguous:
2025-05-07T20:33:08.5329961Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5330052Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5330128Z     
2025-05-07T20:33:08.5330228Z         if scale_ub is not None:
2025-05-07T20:33:08.5330336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5330475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5330561Z             )
2025-05-07T20:33:08.5330690Z         else:
2025-05-07T20:33:08.5330789Z             scale_ub_tensor = None
2025-05-07T20:33:08.5330869Z     
2025-05-07T20:33:08.5331006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5331106Z             op = silu_mul_quant
2025-05-07T20:33:08.5331192Z             if compiled:
2025-05-07T20:33:08.5331296Z                 op = torch.compile(op)
2025-05-07T20:33:08.5331416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5331575Z     
2025-05-07T20:33:08.5331670Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5331675Z 
2025-05-07T20:33:08.5331784Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5331915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5332019Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5332128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5332496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5332595Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5333093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5333192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5333592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5333826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5334171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5334275Z     kernel = self.compile(
2025-05-07T20:33:08.5334652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5334829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5335008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5335012Z 
2025-05-07T20:33:08.5335220Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7191f0>
2025-05-07T20:33:08.5336001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5336517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e737940>}
2025-05-07T20:33:08.5337264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5337458Z context = <triton._C.libtriton.ir.context object at 0x7f158e6e5fb0>
2025-05-07T20:33:08.5337466Z 
2025-05-07T20:33:08.5337638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5337911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5338023Z                            module_map=module_map)
2025-05-07T20:33:08.5338191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5338300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5338382Z E       ^
2025-05-07T20:33:08.5338744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5338749Z 
2025-05-07T20:33:08.5339158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5339162Z 
2025-05-07T20:33:08.5339269Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5339540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5339624Z     T=4096,
2025-05-07T20:33:08.5339714Z     D=5120,
2025-05-07T20:33:08.5339801Z     scale_ub=1200.0,
2025-05-07T20:33:08.5339888Z     contiguous=True,
2025-05-07T20:33:08.5339981Z     compiled=True,
2025-05-07T20:33:08.5340056Z )
2025-05-07T20:33:08.5343368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5343582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5343837Z 
2025-05-07T20:33:08.5343919Z     @given(
2025-05-07T20:33:08.5344043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5344149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5344265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5344389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5344502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5344576Z     )
2025-05-07T20:33:08.5344837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5344935Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5345011Z         self,
2025-05-07T20:33:08.5345099Z         T: int,
2025-05-07T20:33:08.5345177Z         D: int,
2025-05-07T20:33:08.5345359Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5345458Z         contiguous: bool,
2025-05-07T20:33:08.5345544Z         compiled: bool,
2025-05-07T20:33:08.5345629Z     ) -> None:
2025-05-07T20:33:08.5345731Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5345804Z     
2025-05-07T20:33:08.5345981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5346056Z     
2025-05-07T20:33:08.5346148Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5346280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5346370Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5346451Z         x0 = x[:, :D]
2025-05-07T20:33:08.5346617Z         x1 = x[:, D:]
2025-05-07T20:33:08.5346691Z     
2025-05-07T20:33:08.5346776Z         if contiguous:
2025-05-07T20:33:08.5346877Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5346968Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5347040Z     
2025-05-07T20:33:08.5347138Z         if scale_ub is not None:
2025-05-07T20:33:08.5347247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5347386Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5347471Z             )
2025-05-07T20:33:08.5347550Z         else:
2025-05-07T20:33:08.5347652Z             scale_ub_tensor = None
2025-05-07T20:33:08.5347725Z     
2025-05-07T20:33:08.5347857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5347954Z             op = silu_mul_quant
2025-05-07T20:33:08.5348039Z             if compiled:
2025-05-07T20:33:08.5348140Z                 op = torch.compile(op)
2025-05-07T20:33:08.5348252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5348331Z     
2025-05-07T20:33:08.5348424Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5348428Z 
2025-05-07T20:33:08.5348535Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5348665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5348775Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5348877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5349248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5349353Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5349845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5349941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5350305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5350609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5350964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5351059Z     kernel = self.compile(
2025-05-07T20:33:08.5351440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5351624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5351793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5351798Z 
2025-05-07T20:33:08.5352012Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e6fa5e0>
2025-05-07T20:33:08.5352783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5353287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e5c1790>}
2025-05-07T20:33:08.5354074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5354268Z context = <triton._C.libtriton.ir.context object at 0x7f158e50bf30>
2025-05-07T20:33:08.5354275Z 
2025-05-07T20:33:08.5354454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5354717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5354828Z                            module_map=module_map)
2025-05-07T20:33:08.5354999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5355102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5355222Z E       ^
2025-05-07T20:33:08.5355581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5355586Z 
2025-05-07T20:33:08.5356006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5356010Z 
2025-05-07T20:33:08.5356120Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5356349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5356427Z     T=128,
2025-05-07T20:33:08.5356512Z     D=5120,
2025-05-07T20:33:08.5356596Z     scale_ub=1200.0,
2025-05-07T20:33:08.5356692Z     contiguous=False,
2025-05-07T20:33:08.5356777Z     compiled=True,
2025-05-07T20:33:08.5356853Z )
2025-05-07T20:33:08.5357077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5357254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5357260Z 
2025-05-07T20:33:08.5357338Z     @given(
2025-05-07T20:33:08.5357465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5357565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5357680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5357809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5357922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5358005Z     )
2025-05-07T20:33:08.5358250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5358344Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5358426Z         self,
2025-05-07T20:33:08.5358505Z         T: int,
2025-05-07T20:33:08.5358581Z         D: int,
2025-05-07T20:33:08.5358686Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5358775Z         contiguous: bool,
2025-05-07T20:33:08.5358863Z         compiled: bool,
2025-05-07T20:33:08.5358996Z     ) -> None:
2025-05-07T20:33:08.5359094Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5359167Z     
2025-05-07T20:33:08.5359343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5359417Z     
2025-05-07T20:33:08.5359513Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5359639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5359729Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5359858Z         x0 = x[:, :D]
2025-05-07T20:33:08.5359938Z         x1 = x[:, D:]
2025-05-07T20:33:08.5360014Z     
2025-05-07T20:33:08.5360104Z         if contiguous:
2025-05-07T20:33:08.5360197Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5360287Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5360366Z     
2025-05-07T20:33:08.5360458Z         if scale_ub is not None:
2025-05-07T20:33:08.5360567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5360709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5360795Z             )
2025-05-07T20:33:08.5360873Z         else:
2025-05-07T20:33:08.5360972Z             scale_ub_tensor = None
2025-05-07T20:33:08.5361045Z     
2025-05-07T20:33:08.5361184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5361274Z             op = silu_mul_quant
2025-05-07T20:33:08.5361403Z             if compiled:
2025-05-07T20:33:08.5361512Z                 op = torch.compile(op)
2025-05-07T20:33:08.5361619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5361694Z     
2025-05-07T20:33:08.5361790Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5361794Z 
2025-05-07T20:33:08.5361891Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5362022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5362128Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5362228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5362602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5362743Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5363235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5363346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5363709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5363936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5364287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5364382Z     kernel = self.compile(
2025-05-07T20:33:08.5364767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5364944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5365074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5365079Z 
2025-05-07T20:33:08.5365293Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e620910>
2025-05-07T20:33:08.5366064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5366568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e61e0d0>}
2025-05-07T20:33:08.5367309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5367555Z context = <triton._C.libtriton.ir.context object at 0x7f158e61f970>
2025-05-07T20:33:08.5367560Z 
2025-05-07T20:33:08.5367727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5367988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5368108Z                            module_map=module_map)
2025-05-07T20:33:08.5368274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5368414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5368500Z E       ^
2025-05-07T20:33:08.5368854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5368859Z 
2025-05-07T20:33:08.5369278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5369282Z 
2025-05-07T20:33:08.5369386Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5369615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5369703Z     T=16384,
2025-05-07T20:33:08.5369781Z     D=7168,
2025-05-07T20:33:08.5369865Z     scale_ub=1200.0,
2025-05-07T20:33:08.5369959Z     contiguous=True,
2025-05-07T20:33:08.5370045Z     compiled=True,
2025-05-07T20:33:08.5370165Z )
2025-05-07T20:33:08.5370393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5370571Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5370575Z 
2025-05-07T20:33:08.5370663Z     @given(
2025-05-07T20:33:08.5370782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5370882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5371005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5371125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5371242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5371407Z     )
2025-05-07T20:33:08.5371659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5371762Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5371841Z         self,
2025-05-07T20:33:08.5371920Z         T: int,
2025-05-07T20:33:08.5372008Z         D: int,
2025-05-07T20:33:08.5372112Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5372202Z         contiguous: bool,
2025-05-07T20:33:08.5372300Z         compiled: bool,
2025-05-07T20:33:08.5372380Z     ) -> None:
2025-05-07T20:33:08.5372474Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5372555Z     
2025-05-07T20:33:08.5372725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5372799Z     
2025-05-07T20:33:08.5372895Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5373020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5373118Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5373204Z         x0 = x[:, :D]
2025-05-07T20:33:08.5373285Z         x1 = x[:, D:]
2025-05-07T20:33:08.5373362Z     
2025-05-07T20:33:08.5373448Z         if contiguous:
2025-05-07T20:33:08.5373541Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5373636Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5373708Z     
2025-05-07T20:33:08.5373802Z         if scale_ub is not None:
2025-05-07T20:33:08.5373916Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5374053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5374132Z             )
2025-05-07T20:33:08.5374217Z         else:
2025-05-07T20:33:08.5374311Z             scale_ub_tensor = None
2025-05-07T20:33:08.5374384Z     
2025-05-07T20:33:08.5374522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5374612Z             op = silu_mul_quant
2025-05-07T20:33:08.5374704Z             if compiled:
2025-05-07T20:33:08.5374804Z                 op = torch.compile(op)
2025-05-07T20:33:08.5374967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5375050Z     
2025-05-07T20:33:08.5375142Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5375147Z 
2025-05-07T20:33:08.5375245Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5375379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5375484Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5375585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5376008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5376101Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5376602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5376700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5377059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5377296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5377630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5377731Z     kernel = self.compile(
2025-05-07T20:33:08.5378158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5378341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5378474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5378479Z 
2025-05-07T20:33:08.5378683Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e629610>
2025-05-07T20:33:08.5379455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5380013Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e61ed30>}
2025-05-07T20:33:08.5380768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5380966Z context = <triton._C.libtriton.ir.context object at 0x7f158e52c8b0>
2025-05-07T20:33:08.5380971Z 
2025-05-07T20:33:08.5381307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5381581Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5381690Z                            module_map=module_map)
2025-05-07T20:33:08.5381851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5381963Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5382041Z E       ^
2025-05-07T20:33:08.5382393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5382398Z 
2025-05-07T20:33:08.5382827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5382831Z 
2025-05-07T20:33:08.5382936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5383168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5383247Z     T=16384,
2025-05-07T20:33:08.5383324Z     D=5120,
2025-05-07T20:33:08.5383416Z     scale_ub=1200.0,
2025-05-07T20:33:08.5383502Z     contiguous=True,
2025-05-07T20:33:08.5383586Z     compiled=False,
2025-05-07T20:33:08.5383669Z )
2025-05-07T20:33:08.5383886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5384120Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5384126Z 
2025-05-07T20:33:08.5384210Z     @given(
2025-05-07T20:33:08.5384330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5384437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5384555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5384673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5384838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5384914Z     )
2025-05-07T20:33:08.5385159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5385263Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5385343Z         self,
2025-05-07T20:33:08.5385429Z         T: int,
2025-05-07T20:33:08.5385506Z         D: int,
2025-05-07T20:33:08.5385606Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5385703Z         contiguous: bool,
2025-05-07T20:33:08.5385797Z         compiled: bool,
2025-05-07T20:33:08.5385876Z     ) -> None:
2025-05-07T20:33:08.5385977Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5386051Z     
2025-05-07T20:33:08.5386223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5386308Z     
2025-05-07T20:33:08.5386444Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5386570Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5386670Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5386755Z         x0 = x[:, :D]
2025-05-07T20:33:08.5386835Z         x1 = x[:, D:]
2025-05-07T20:33:08.5386916Z     
2025-05-07T20:33:08.5386999Z         if contiguous:
2025-05-07T20:33:08.5387100Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5387190Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5387265Z     
2025-05-07T20:33:08.5387364Z         if scale_ub is not None:
2025-05-07T20:33:08.5387471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5387655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5387739Z             )
2025-05-07T20:33:08.5387817Z         else:
2025-05-07T20:33:08.5387910Z             scale_ub_tensor = None
2025-05-07T20:33:08.5387989Z     
2025-05-07T20:33:08.5388123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5388220Z             op = silu_mul_quant
2025-05-07T20:33:08.5388313Z             if compiled:
2025-05-07T20:33:08.5388412Z                 op = torch.compile(op)
2025-05-07T20:33:08.5388527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5388599Z     
2025-05-07T20:33:08.5388690Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5388694Z 
2025-05-07T20:33:08.5388799Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5388926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5389026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5389131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5389634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5389739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5390100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5390324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5390671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5390764Z     kernel = self.compile(
2025-05-07T20:33:08.5391144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5391326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5391453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5391506Z 
2025-05-07T20:33:08.5391718Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e7a7f70>
2025-05-07T20:33:08.5392487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5392985Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e452700>}
2025-05-07T20:33:08.5393783Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5393975Z context = <triton._C.libtriton.ir.context object at 0x7f158e7bcbf0>
2025-05-07T20:33:08.5393979Z 
2025-05-07T20:33:08.5394164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5394426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5394533Z                            module_map=module_map)
2025-05-07T20:33:08.5394703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5394848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5394932Z E       ^
2025-05-07T20:33:08.5395291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5395298Z 
2025-05-07T20:33:08.5395708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5395713Z 
2025-05-07T20:33:08.5395820Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5396042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5396170Z     T=1,
2025-05-07T20:33:08.5396251Z     D=7168,
2025-05-07T20:33:08.5396335Z     scale_ub=1200.0,
2025-05-07T20:33:08.5396428Z     contiguous=False,
2025-05-07T20:33:08.5396514Z     compiled=False,
2025-05-07T20:33:08.5396587Z )
2025-05-07T20:33:08.5396810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5396982Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5396987Z 
2025-05-07T20:33:08.5397068Z     @given(
2025-05-07T20:33:08.5397194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5397294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5397415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5397533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5402869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5402968Z     )
2025-05-07T20:33:08.5403240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5403351Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5403431Z         self,
2025-05-07T20:33:08.5403518Z         T: int,
2025-05-07T20:33:08.5403600Z         D: int,
2025-05-07T20:33:08.5403704Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5403811Z         contiguous: bool,
2025-05-07T20:33:08.5403906Z         compiled: bool,
2025-05-07T20:33:08.5403989Z     ) -> None:
2025-05-07T20:33:08.5404098Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5404176Z     
2025-05-07T20:33:08.5404367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5404453Z     
2025-05-07T20:33:08.5404549Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5404687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5404790Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5404874Z         x0 = x[:, :D]
2025-05-07T20:33:08.5404957Z         x1 = x[:, D:]
2025-05-07T20:33:08.5405040Z     
2025-05-07T20:33:08.5405205Z         if contiguous:
2025-05-07T20:33:08.5405312Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5405404Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5405479Z     
2025-05-07T20:33:08.5405581Z         if scale_ub is not None:
2025-05-07T20:33:08.5405690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5405831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5405919Z             )
2025-05-07T20:33:08.5405999Z         else:
2025-05-07T20:33:08.5406135Z             scale_ub_tensor = None
2025-05-07T20:33:08.5406219Z     
2025-05-07T20:33:08.5406353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5406448Z             op = silu_mul_quant
2025-05-07T20:33:08.5406546Z             if compiled:
2025-05-07T20:33:08.5406652Z                 op = torch.compile(op)
2025-05-07T20:33:08.5406767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5406841Z     
2025-05-07T20:33:08.5406934Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5406946Z 
2025-05-07T20:33:08.5407057Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5407190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5407292Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5407400Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5407980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5408096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5408456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5408685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5409036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5409135Z     kernel = self.compile(
2025-05-07T20:33:08.5409573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5409757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5409886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5409893Z 
2025-05-07T20:33:08.5410109Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e3edaf0>
2025-05-07T20:33:08.5410900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5411398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e6900d0>}
2025-05-07T20:33:08.5412147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5412342Z context = <triton._C.libtriton.ir.context object at 0x7f158e68f330>
2025-05-07T20:33:08.5412347Z 
2025-05-07T20:33:08.5412528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5412793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5412908Z                            module_map=module_map)
2025-05-07T20:33:08.5413081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5413180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5413267Z E       ^
2025-05-07T20:33:08.5413621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5413626Z 
2025-05-07T20:33:08.5414082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5414089Z 
2025-05-07T20:33:08.5414208Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5414431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5414517Z     T=4096,
2025-05-07T20:33:08.5414599Z     D=7168,
2025-05-07T20:33:08.5414686Z     scale_ub=1200.0,
2025-05-07T20:33:08.5414787Z     contiguous=False,
2025-05-07T20:33:08.5414913Z     compiled=True,
2025-05-07T20:33:08.5414989Z )
2025-05-07T20:33:08.5415219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5415394Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5415398Z 
2025-05-07T20:33:08.5415477Z     @given(
2025-05-07T20:33:08.5415605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5415706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5415838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5415956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5416075Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5416158Z     )
2025-05-07T20:33:08.5416443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5416540Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5416627Z         self,
2025-05-07T20:33:08.5416707Z         T: int,
2025-05-07T20:33:08.5416785Z         D: int,
2025-05-07T20:33:08.5416893Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5416983Z         contiguous: bool,
2025-05-07T20:33:08.5417071Z         compiled: bool,
2025-05-07T20:33:08.5417161Z     ) -> None:
2025-05-07T20:33:08.5417257Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5417338Z     
2025-05-07T20:33:08.5417508Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5417583Z     
2025-05-07T20:33:08.5417728Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5417857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5417948Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5418038Z         x0 = x[:, :D]
2025-05-07T20:33:08.5418119Z         x1 = x[:, D:]
2025-05-07T20:33:08.5418196Z     
2025-05-07T20:33:08.5418293Z         if contiguous:
2025-05-07T20:33:08.5418388Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5418480Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5418564Z     
2025-05-07T20:33:08.5418659Z         if scale_ub is not None:
2025-05-07T20:33:08.5418768Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5418920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5418999Z             )
2025-05-07T20:33:08.5419086Z         else:
2025-05-07T20:33:08.5419183Z             scale_ub_tensor = None
2025-05-07T20:33:08.5419260Z     
2025-05-07T20:33:08.5419403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5419500Z             op = silu_mul_quant
2025-05-07T20:33:08.5419589Z             if compiled:
2025-05-07T20:33:08.5419699Z                 op = torch.compile(op)
2025-05-07T20:33:08.5419806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5419883Z     
2025-05-07T20:33:08.5419986Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5419993Z 
2025-05-07T20:33:08.5420093Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5420232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5420339Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5420442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5420823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5420917Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5421572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5421687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5422045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5422284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5422625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5422763Z     kernel = self.compile(
2025-05-07T20:33:08.5423154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5423334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5423462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5423475Z 
2025-05-07T20:33:08.5423683Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e39a250>
2025-05-07T20:33:08.5424477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5425036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e690dc0>}
2025-05-07T20:33:08.5425794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5425995Z context = <triton._C.libtriton.ir.context object at 0x7f158e49c430>
2025-05-07T20:33:08.5426000Z 
2025-05-07T20:33:08.5426169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5426436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5426595Z                            module_map=module_map)
2025-05-07T20:33:08.5426762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5426872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5426951Z E       ^
2025-05-07T20:33:08.5427305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5427313Z 
2025-05-07T20:33:08.5427740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5427745Z 
2025-05-07T20:33:08.5427849Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5428072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5428162Z     T=128,
2025-05-07T20:33:08.5428241Z     D=7168,
2025-05-07T20:33:08.5428336Z     scale_ub=1200.0,
2025-05-07T20:33:08.5428424Z     contiguous=False,
2025-05-07T20:33:08.5428518Z     compiled=True,
2025-05-07T20:33:08.5428606Z )
2025-05-07T20:33:08.5428863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5429043Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5429047Z 
2025-05-07T20:33:08.5429136Z     @given(
2025-05-07T20:33:08.5429257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5429361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5429491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5429612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5429735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5429813Z     )
2025-05-07T20:33:08.5430057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5430160Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5430238Z         self,
2025-05-07T20:33:08.5430362Z         T: int,
2025-05-07T20:33:08.5430453Z         D: int,
2025-05-07T20:33:08.5430552Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5430642Z         contiguous: bool,
2025-05-07T20:33:08.5430738Z         compiled: bool,
2025-05-07T20:33:08.5430821Z     ) -> None:
2025-05-07T20:33:08.5430919Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5431006Z     
2025-05-07T20:33:08.5431176Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5431299Z     
2025-05-07T20:33:08.5431395Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5431523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5431622Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5431706Z         x0 = x[:, :D]
2025-05-07T20:33:08.5431787Z         x1 = x[:, D:]
2025-05-07T20:33:08.5431869Z     
2025-05-07T20:33:08.5431954Z         if contiguous:
2025-05-07T20:33:08.5432047Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5432146Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5432228Z     
2025-05-07T20:33:08.5432319Z         if scale_ub is not None:
2025-05-07T20:33:08.5432435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5432573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5432658Z             )
2025-05-07T20:33:08.5432736Z         else:
2025-05-07T20:33:08.5432873Z             scale_ub_tensor = None
2025-05-07T20:33:08.5432959Z     
2025-05-07T20:33:08.5433094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5433189Z             op = silu_mul_quant
2025-05-07T20:33:08.5433283Z             if compiled:
2025-05-07T20:33:08.5433387Z                 op = torch.compile(op)
2025-05-07T20:33:08.5433498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5433579Z     
2025-05-07T20:33:08.5433672Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5433676Z 
2025-05-07T20:33:08.5433775Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5433918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5434067Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5434176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5434549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5434651Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5435162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5435266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5435625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5435862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5436199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5436306Z     kernel = self.compile(
2025-05-07T20:33:08.5436693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5436870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5437011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5437016Z 
2025-05-07T20:33:08.5437220Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e4b55e0>
2025-05-07T20:33:08.5438016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5438518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e3bf940>}
2025-05-07T20:33:08.5439313Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5439515Z context = <triton._C.libtriton.ir.context object at 0x7f158e5ce1f0>
2025-05-07T20:33:08.5439519Z 
2025-05-07T20:33:08.5439703Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5439966Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5440675Z                            module_map=module_map)
2025-05-07T20:33:08.5440878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5440978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5441065Z E       ^
2025-05-07T20:33:08.5441420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5441431Z 
2025-05-07T20:33:08.5441860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5441865Z 
2025-05-07T20:33:08.5441972Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5442375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5442466Z     T=2048,
2025-05-07T20:33:08.5442544Z     D=7168,
2025-05-07T20:33:08.5442634Z     scale_ub=None,
2025-05-07T20:33:08.5442723Z     contiguous=True,
2025-05-07T20:33:08.5442810Z     compiled=True,
2025-05-07T20:33:08.5442890Z )
2025-05-07T20:33:08.5443108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5443286Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.5443290Z 
2025-05-07T20:33:08.5443374Z     @given(
2025-05-07T20:33:08.5443495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5443601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5443797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5443914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5444036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5444112Z     )
2025-05-07T20:33:08.5444361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5444462Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5444541Z         self,
2025-05-07T20:33:08.5444621Z         T: int,
2025-05-07T20:33:08.5444706Z         D: int,
2025-05-07T20:33:08.5444806Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5444895Z         contiguous: bool,
2025-05-07T20:33:08.5444994Z         compiled: bool,
2025-05-07T20:33:08.5445073Z     ) -> None:
2025-05-07T20:33:08.5445169Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5445251Z     
2025-05-07T20:33:08.5445421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5445508Z     
2025-05-07T20:33:08.5445600Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5445726Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5445825Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5445906Z         x0 = x[:, :D]
2025-05-07T20:33:08.5445987Z         x1 = x[:, D:]
2025-05-07T20:33:08.5446069Z     
2025-05-07T20:33:08.5446155Z         if contiguous:
2025-05-07T20:33:08.5446246Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5446345Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5446418Z     
2025-05-07T20:33:08.5446509Z         if scale_ub is not None:
2025-05-07T20:33:08.5446621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5446759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5446836Z             )
2025-05-07T20:33:08.5446921Z         else:
2025-05-07T20:33:08.5447015Z             scale_ub_tensor = None
2025-05-07T20:33:08.5447096Z     
2025-05-07T20:33:08.5447307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5447401Z             op = silu_mul_quant
2025-05-07T20:33:08.5447493Z             if compiled:
2025-05-07T20:33:08.5447595Z                 op = torch.compile(op)
2025-05-07T20:33:08.5447702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5447781Z     
2025-05-07T20:33:08.5447875Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5447880Z 
2025-05-07T20:33:08.5447977Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5448175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5448276Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5448384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5448749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5448843Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5449344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5449443Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5449800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5450084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5450429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5450533Z     kernel = self.compile(
2025-05-07T20:33:08.5450910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5451085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5451220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5451224Z 
2025-05-07T20:33:08.5451433Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e4b0460>
2025-05-07T20:33:08.5452268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5452767Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e383550>}
2025-05-07T20:33:08.5453522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5453720Z context = <triton._C.libtriton.ir.context object at 0x7f158e27e430>
2025-05-07T20:33:08.5453724Z 
2025-05-07T20:33:08.5453890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5454162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5454272Z                            module_map=module_map)
2025-05-07T20:33:08.5454434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5454539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5454620Z E       ^
2025-05-07T20:33:08.5454971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5454982Z 
2025-05-07T20:33:08.5455396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5455400Z 
2025-05-07T20:33:08.5455505Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5455738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5455817Z     T=16384,
2025-05-07T20:33:08.5455896Z     D=5120,
2025-05-07T20:33:08.5455985Z     scale_ub=None,
2025-05-07T20:33:08.5456119Z     contiguous=False,
2025-05-07T20:33:08.5456209Z     compiled=False,
2025-05-07T20:33:08.5456292Z )
2025-05-07T20:33:08.5456510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5456695Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5456702Z 
2025-05-07T20:33:08.5456779Z     @given(
2025-05-07T20:33:08.5456898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5457047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5457162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5457280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5457403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5457478Z     )
2025-05-07T20:33:08.5457735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5457833Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5457917Z         self,
2025-05-07T20:33:08.5458002Z         T: int,
2025-05-07T20:33:08.5458079Z         D: int,
2025-05-07T20:33:08.5458178Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5458275Z         contiguous: bool,
2025-05-07T20:33:08.5458363Z         compiled: bool,
2025-05-07T20:33:08.5458441Z     ) -> None:
2025-05-07T20:33:08.5458585Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5458662Z     
2025-05-07T20:33:08.5458831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5458919Z     
2025-05-07T20:33:08.5459013Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5459142Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5460976Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5461114Z 
2025-05-07T20:33:08.5461250Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.5461255Z 
2025-05-07T20:33:08.5461359Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5461584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5461668Z     T=4096,
2025-05-07T20:33:08.5461746Z     D=7168,
2025-05-07T20:33:08.5461833Z     scale_ub=1200.0,
2025-05-07T20:33:08.5461926Z     contiguous=True,
2025-05-07T20:33:08.5462011Z     compiled=True,
2025-05-07T20:33:08.5462085Z )
2025-05-07T20:33:08.5462307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5462482Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5462489Z 
2025-05-07T20:33:08.5462573Z     @given(
2025-05-07T20:33:08.5462690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5462789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5462909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5463027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5463140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5463225Z     )
2025-05-07T20:33:08.5463472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5463566Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5463648Z         self,
2025-05-07T20:33:08.5463725Z         T: int,
2025-05-07T20:33:08.5463809Z         D: int,
2025-05-07T20:33:08.5463908Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5463997Z         contiguous: bool,
2025-05-07T20:33:08.5464089Z         compiled: bool,
2025-05-07T20:33:08.5464213Z     ) -> None:
2025-05-07T20:33:08.5464311Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5464387Z     
2025-05-07T20:33:08.5464570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5464645Z     
2025-05-07T20:33:08.5464743Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5464874Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5466664Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5466724Z 
2025-05-07T20:33:08.5466844Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.5466848Z 
2025-05-07T20:33:08.5466952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5467179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5467257Z     T=16384,
2025-05-07T20:33:08.5467377Z     D=7168,
2025-05-07T20:33:08.5467472Z     scale_ub=None,
2025-05-07T20:33:08.5467561Z     contiguous=False,
2025-05-07T20:33:08.5467649Z     compiled=False,
2025-05-07T20:33:08.5467730Z )
2025-05-07T20:33:08.5467947Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5468131Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5468135Z 
2025-05-07T20:33:08.5468215Z     @given(
2025-05-07T20:33:08.5468334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5468439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5468601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5468719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5468842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5468919Z     )
2025-05-07T20:33:08.5469168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5469271Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5469350Z         self,
2025-05-07T20:33:08.5469439Z         T: int,
2025-05-07T20:33:08.5469516Z         D: int,
2025-05-07T20:33:08.5469615Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5469714Z         contiguous: bool,
2025-05-07T20:33:08.5469806Z         compiled: bool,
2025-05-07T20:33:08.5469884Z     ) -> None:
2025-05-07T20:33:08.5469987Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5470060Z     
2025-05-07T20:33:08.5470232Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5471997Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5472007Z 
2025-05-07T20:33:08.5472124Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5472128Z 
2025-05-07T20:33:08.5472237Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5472457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5472539Z     T=2048,
2025-05-07T20:33:08.5472616Z     D=7168,
2025-05-07T20:33:08.5472702Z     scale_ub=1200.0,
2025-05-07T20:33:08.5472861Z     contiguous=True,
2025-05-07T20:33:08.5472947Z     compiled=True,
2025-05-07T20:33:08.5473019Z )
2025-05-07T20:33:08.5473241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5473412Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5473416Z 
2025-05-07T20:33:08.5473495Z     @given(
2025-05-07T20:33:08.5473619Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5473784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5473905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5474022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5474135Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5474215Z     )
2025-05-07T20:33:08.5474461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5474555Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5474643Z         self,
2025-05-07T20:33:08.5474728Z         T: int,
2025-05-07T20:33:08.5474805Z         D: int,
2025-05-07T20:33:08.5474911Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5475003Z         contiguous: bool,
2025-05-07T20:33:08.5475091Z         compiled: bool,
2025-05-07T20:33:08.5475174Z     ) -> None:
2025-05-07T20:33:08.5475311Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5475395Z     
2025-05-07T20:33:08.5475562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5475639Z     
2025-05-07T20:33:08.5475739Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5475864Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5477609Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5477663Z 
2025-05-07T20:33:08.5477787Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.5477793Z 
2025-05-07T20:33:08.5477895Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5478127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5478204Z     T=2048,
2025-05-07T20:33:08.5478280Z     D=7168,
2025-05-07T20:33:08.5478369Z     scale_ub=None,
2025-05-07T20:33:08.5478457Z     contiguous=True,
2025-05-07T20:33:08.5478548Z     compiled=False,
2025-05-07T20:33:08.5478622Z )
2025-05-07T20:33:08.5478837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5479021Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5479028Z 
2025-05-07T20:33:08.5479104Z     @given(
2025-05-07T20:33:08.5479221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5479329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5479449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5479570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5479692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5479770Z     )
2025-05-07T20:33:08.5480023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5480117Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5480195Z         self,
2025-05-07T20:33:08.5480278Z         T: int,
2025-05-07T20:33:08.5480356Z         D: int,
2025-05-07T20:33:08.5480454Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5480549Z         contiguous: bool,
2025-05-07T20:33:08.5480635Z         compiled: bool,
2025-05-07T20:33:08.5480841Z     ) -> None:
2025-05-07T20:33:08.5480947Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5481022Z     
2025-05-07T20:33:08.5481194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5481274Z     
2025-05-07T20:33:08.5481368Z >       x_sign = torch.sign(x)
2025-05-07T20:33:08.5483120Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5483165Z 
2025-05-07T20:33:08.5483287Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:08.5483299Z 
2025-05-07T20:33:08.5483407Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5483629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5483705Z     T=1,
2025-05-07T20:33:08.5483790Z     D=7168,
2025-05-07T20:33:08.5483874Z     scale_ub=1200.0,
2025-05-07T20:33:08.5483996Z     contiguous=True,
2025-05-07T20:33:08.5484089Z     compiled=False,
2025-05-07T20:33:08.5484162Z )
2025-05-07T20:33:08.5484380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5484553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5484558Z 
2025-05-07T20:33:08.5484635Z     @given(
2025-05-07T20:33:08.5484759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5484859Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5484975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5485102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5485255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5485329Z     )
2025-05-07T20:33:08.5485582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5485676Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5485754Z         self,
2025-05-07T20:33:08.5485841Z         T: int,
2025-05-07T20:33:08.5485919Z         D: int,
2025-05-07T20:33:08.5486025Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5486118Z         contiguous: bool,
2025-05-07T20:33:08.5486205Z         compiled: bool,
2025-05-07T20:33:08.5486291Z     ) -> None:
2025-05-07T20:33:08.5486385Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5486457Z     
2025-05-07T20:33:08.5486629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5486704Z     
2025-05-07T20:33:08.5486796Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5486927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5487022Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5487103Z         x0 = x[:, :D]
2025-05-07T20:33:08.5487190Z         x1 = x[:, D:]
2025-05-07T20:33:08.5487262Z     
2025-05-07T20:33:08.5487346Z         if contiguous:
2025-05-07T20:33:08.5487443Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5487537Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5487617Z     
2025-05-07T20:33:08.5487708Z         if scale_ub is not None:
2025-05-07T20:33:08.5487816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5487960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5488036Z             )
2025-05-07T20:33:08.5488113Z         else:
2025-05-07T20:33:08.5488214Z             scale_ub_tensor = None
2025-05-07T20:33:08.5488287Z     
2025-05-07T20:33:08.5488417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5488512Z             op = silu_mul_quant
2025-05-07T20:33:08.5488599Z             if compiled:
2025-05-07T20:33:08.5488751Z                 op = torch.compile(op)
2025-05-07T20:33:08.5488866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5488939Z     
2025-05-07T20:33:08.5489037Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5489041Z 
2025-05-07T20:33:08.5489140Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5489271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5489381Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5489522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5490031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5490133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5490499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5490731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5491078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5491173Z     kernel = self.compile(
2025-05-07T20:33:08.5491606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5491788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5491916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5491931Z 
2025-05-07T20:33:08.5492141Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e2d7910>
2025-05-07T20:33:08.5492911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5493421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e1a3040>}
2025-05-07T20:33:08.5494250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5494446Z context = <triton._C.libtriton.ir.context object at 0x7f158e1a0730>
2025-05-07T20:33:08.5494454Z 
2025-05-07T20:33:08.5494622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5494887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5495003Z                            module_map=module_map)
2025-05-07T20:33:08.5495166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5495272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5495351Z E       ^
2025-05-07T20:33:08.5495711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5495715Z 
2025-05-07T20:33:08.5496135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5496139Z 
2025-05-07T20:33:08.5496245Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5496467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5496558Z     T=128,
2025-05-07T20:33:08.5496636Z     D=5120,
2025-05-07T20:33:08.5496726Z     scale_ub=None,
2025-05-07T20:33:08.5496814Z     contiguous=True,
2025-05-07T20:33:08.5496899Z     compiled=False,
2025-05-07T20:33:08.5496984Z )
2025-05-07T20:33:08.5497202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5497373Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5497380Z 
2025-05-07T20:33:08.5497508Z     @given(
2025-05-07T20:33:08.5497630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5497732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5497856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5497974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5498101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5498178Z     )
2025-05-07T20:33:08.5498463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5498566Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5498642Z         self,
2025-05-07T20:33:08.5498720Z         T: int,
2025-05-07T20:33:08.5498804Z         D: int,
2025-05-07T20:33:08.5498902Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5498994Z         contiguous: bool,
2025-05-07T20:33:08.5499087Z         compiled: bool,
2025-05-07T20:33:08.5499167Z     ) -> None:
2025-05-07T20:33:08.5499269Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5499349Z     
2025-05-07T20:33:08.5499517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5499597Z     
2025-05-07T20:33:08.5499690Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5499818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5499956Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5500041Z         x0 = x[:, :D]
2025-05-07T20:33:08.5500123Z         x1 = x[:, D:]
2025-05-07T20:33:08.5500205Z     
2025-05-07T20:33:08.5500291Z         if contiguous:
2025-05-07T20:33:08.5500384Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5500485Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5500560Z     
2025-05-07T20:33:08.5500650Z         if scale_ub is not None:
2025-05-07T20:33:08.5500766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5500902Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5501068Z             )
2025-05-07T20:33:08.5501218Z         else:
2025-05-07T20:33:08.5501313Z             scale_ub_tensor = None
2025-05-07T20:33:08.5501394Z     
2025-05-07T20:33:08.5501527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5501618Z             op = silu_mul_quant
2025-05-07T20:33:08.5501710Z             if compiled:
2025-05-07T20:33:08.5501813Z                 op = torch.compile(op)
2025-05-07T20:33:08.5501925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5502008Z     
2025-05-07T20:33:08.5502099Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5502104Z 
2025-05-07T20:33:08.5502201Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5502337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5502438Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5502546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5503055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5503155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5503522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5503754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5504098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5504202Z     kernel = self.compile(
2025-05-07T20:33:08.5504581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5504766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5504893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5504897Z 
2025-05-07T20:33:08.5505102Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e190e50>
2025-05-07T20:33:08.5505925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5506438Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e1a3a60>}
2025-05-07T20:33:08.5507239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5507432Z context = <triton._C.libtriton.ir.context object at 0x7f158e5677f0>
2025-05-07T20:33:08.5507437Z 
2025-05-07T20:33:08.5507608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5507876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5507991Z                            module_map=module_map)
2025-05-07T20:33:08.5508158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5508256Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5508335Z E       ^
2025-05-07T20:33:08.5508757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5508764Z 
2025-05-07T20:33:08.5509208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5509213Z 
2025-05-07T20:33:08.5509324Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5509546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5509623Z     T=128,
2025-05-07T20:33:08.5509706Z     D=7168,
2025-05-07T20:33:08.5509789Z     scale_ub=None,
2025-05-07T20:33:08.5509918Z     contiguous=True,
2025-05-07T20:33:08.5510010Z     compiled=False,
2025-05-07T20:33:08.5510085Z )
2025-05-07T20:33:08.5510302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5510479Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5510484Z 
2025-05-07T20:33:08.5510563Z     @given(
2025-05-07T20:33:08.5510690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5510792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5510909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5511034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5511150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5511225Z     )
2025-05-07T20:33:08.5511477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5511572Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5511656Z         self,
2025-05-07T20:33:08.5511741Z         T: int,
2025-05-07T20:33:08.5511818Z         D: int,
2025-05-07T20:33:08.5511925Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5512015Z         contiguous: bool,
2025-05-07T20:33:08.5512102Z         compiled: bool,
2025-05-07T20:33:08.5512191Z     ) -> None:
2025-05-07T20:33:08.5512289Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5512364Z     
2025-05-07T20:33:08.5512541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5512618Z     
2025-05-07T20:33:08.5512709Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5512842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5512934Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5513016Z         x0 = x[:, :D]
2025-05-07T20:33:08.5513106Z         x1 = x[:, D:]
2025-05-07T20:33:08.5513179Z     
2025-05-07T20:33:08.5513269Z         if contiguous:
2025-05-07T20:33:08.5513361Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5513502Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5513584Z     
2025-05-07T20:33:08.5513676Z         if scale_ub is not None:
2025-05-07T20:33:08.5513784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5513926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5514002Z             )
2025-05-07T20:33:08.5514085Z         else:
2025-05-07T20:33:08.5514188Z             scale_ub_tensor = None
2025-05-07T20:33:08.5514261Z     
2025-05-07T20:33:08.5514434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5514530Z             op = silu_mul_quant
2025-05-07T20:33:08.5514617Z             if compiled:
2025-05-07T20:33:08.5514725Z                 op = torch.compile(op)
2025-05-07T20:33:08.5514832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5514905Z     
2025-05-07T20:33:08.5515005Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5515009Z 
2025-05-07T20:33:08.5515107Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5515240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5515346Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5515446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5515987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5516094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5516450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5516684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5517021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5517114Z     kernel = self.compile(
2025-05-07T20:33:08.5517499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5517720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5517852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5517857Z 
2025-05-07T20:33:08.5518061Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e1ad550>
2025-05-07T20:33:08.5518831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5519343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e29f790>}
2025-05-07T20:33:08.5520097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5520296Z context = <triton._C.libtriton.ir.context object at 0x7f158e1806f0>
2025-05-07T20:33:08.5520300Z 
2025-05-07T20:33:08.5520471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5520737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5520851Z                            module_map=module_map)
2025-05-07T20:33:08.5521018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5521122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5521199Z E       ^
2025-05-07T20:33:08.5521551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5521556Z 
2025-05-07T20:33:08.5521977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5521984Z 
2025-05-07T20:33:08.5522126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5522355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5522432Z     T=2048,
2025-05-07T20:33:08.5522509Z     D=7168,
2025-05-07T20:33:08.5522598Z     scale_ub=1200.0,
2025-05-07T20:33:08.5522685Z     contiguous=True,
2025-05-07T20:33:08.5522772Z     compiled=False,
2025-05-07T20:33:08.5522849Z )
2025-05-07T20:33:08.5523067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5523282Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5523287Z 
2025-05-07T20:33:08.5523368Z     @given(
2025-05-07T20:33:08.5523486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5523591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5523706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5523825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5523954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5524028Z     )
2025-05-07T20:33:08.5524274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5524374Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5524452Z         self,
2025-05-07T20:33:08.5524570Z         T: int,
2025-05-07T20:33:08.5524656Z         D: int,
2025-05-07T20:33:08.5524754Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5524848Z         contiguous: bool,
2025-05-07T20:33:08.5524942Z         compiled: bool,
2025-05-07T20:33:08.5525019Z     ) -> None:
2025-05-07T20:33:08.5525122Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5525196Z     
2025-05-07T20:33:08.5530257Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5532092Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5532175Z 
2025-05-07T20:33:08.5532315Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5532322Z 
2025-05-07T20:33:08.5532432Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5532665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5532747Z     T=1,
2025-05-07T20:33:08.5532833Z     D=5120,
2025-05-07T20:33:08.5532931Z     scale_ub=1200.0,
2025-05-07T20:33:08.5533021Z     contiguous=True,
2025-05-07T20:33:08.5533110Z     compiled=False,
2025-05-07T20:33:08.5533197Z )
2025-05-07T20:33:08.5533421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5533593Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5533597Z 
2025-05-07T20:33:08.5533684Z     @given(
2025-05-07T20:33:08.5533808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5533911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5534037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5534159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5534282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5534361Z     )
2025-05-07T20:33:08.5534608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5534714Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5534796Z         self,
2025-05-07T20:33:08.5534878Z         T: int,
2025-05-07T20:33:08.5534972Z         D: int,
2025-05-07T20:33:08.5535121Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5535220Z         contiguous: bool,
2025-05-07T20:33:08.5535316Z         compiled: bool,
2025-05-07T20:33:08.5535403Z     ) -> None:
2025-05-07T20:33:08.5535501Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5535586Z     
2025-05-07T20:33:08.5535766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5535849Z     
2025-05-07T20:33:08.5535943Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5536068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5536213Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5536298Z         x0 = x[:, :D]
2025-05-07T20:33:08.5536380Z         x1 = x[:, D:]
2025-05-07T20:33:08.5536463Z     
2025-05-07T20:33:08.5536549Z         if contiguous:
2025-05-07T20:33:08.5536643Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5536742Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5536817Z     
2025-05-07T20:33:08.5536909Z         if scale_ub is not None:
2025-05-07T20:33:08.5537037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5537175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5537260Z             )
2025-05-07T20:33:08.5537339Z         else:
2025-05-07T20:33:08.5537438Z             scale_ub_tensor = None
2025-05-07T20:33:08.5537519Z     
2025-05-07T20:33:08.5537722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5537817Z             op = silu_mul_quant
2025-05-07T20:33:08.5537914Z             if compiled:
2025-05-07T20:33:08.5538016Z                 op = torch.compile(op)
2025-05-07T20:33:08.5538122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5538202Z     
2025-05-07T20:33:08.5538294Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5538299Z 
2025-05-07T20:33:08.5538398Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5538542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5538643Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5538799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5539307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5539405Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5539773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5540003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5541175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5541283Z     kernel = self.compile(
2025-05-07T20:33:08.5541673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5541860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5541995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5541999Z 
2025-05-07T20:33:08.5542211Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e168a00>
2025-05-07T20:33:08.5542994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5543502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e23f040>}
2025-05-07T20:33:08.5544247Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5544440Z context = <triton._C.libtriton.ir.context object at 0x7f158e2162f0>
2025-05-07T20:33:08.5544611Z 
2025-05-07T20:33:08.5544791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5545062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5545173Z                            module_map=module_map)
2025-05-07T20:33:08.5545347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5545448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5545590Z E       ^
2025-05-07T20:33:08.5545959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5545964Z 
2025-05-07T20:33:08.5546381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5546386Z 
2025-05-07T20:33:08.5546498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5546724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5546807Z     T=2048,
2025-05-07T20:33:08.5546893Z     D=5120,
2025-05-07T20:33:08.5546978Z     scale_ub=None,
2025-05-07T20:33:08.5547064Z     contiguous=True,
2025-05-07T20:33:08.5547157Z     compiled=False,
2025-05-07T20:33:08.5547234Z )
2025-05-07T20:33:08.5547526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5547708Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5547715Z 
2025-05-07T20:33:08.5547794Z     @given(
2025-05-07T20:33:08.5547925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5548026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5548144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5548267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5548382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5548458Z     )
2025-05-07T20:33:08.5548787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5548883Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5548971Z         self,
2025-05-07T20:33:08.5549052Z         T: int,
2025-05-07T20:33:08.5549132Z         D: int,
2025-05-07T20:33:08.5549242Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5549336Z         contiguous: bool,
2025-05-07T20:33:08.5549426Z         compiled: bool,
2025-05-07T20:33:08.5549518Z     ) -> None:
2025-05-07T20:33:08.5549616Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5549692Z     
2025-05-07T20:33:08.5549872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5549950Z     
2025-05-07T20:33:08.5550043Z >       x_sign = torch.sign(x)
2025-05-07T20:33:08.5551811Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5551820Z 
2025-05-07T20:33:08.5551939Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:08.5551955Z 
2025-05-07T20:33:08.5552060Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5552284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5552371Z     T=16384,
2025-05-07T20:33:08.5552449Z     D=5120,
2025-05-07T20:33:08.5552534Z     scale_ub=None,
2025-05-07T20:33:08.5552634Z     contiguous=True,
2025-05-07T20:33:08.5552721Z     compiled=False,
2025-05-07T20:33:08.5552799Z )
2025-05-07T20:33:08.5553068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5553257Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5553262Z 
2025-05-07T20:33:08.5553340Z     @given(
2025-05-07T20:33:08.5553468Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5553572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5553701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5553822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5553979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5554063Z     )
2025-05-07T20:33:08.5554310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5554408Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5554498Z         self,
2025-05-07T20:33:08.5554577Z         T: int,
2025-05-07T20:33:08.5554656Z         D: int,
2025-05-07T20:33:08.5554770Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5554870Z         contiguous: bool,
2025-05-07T20:33:08.5554964Z         compiled: bool,
2025-05-07T20:33:08.5555047Z     ) -> None:
2025-05-07T20:33:08.5555145Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5555230Z     
2025-05-07T20:33:08.5555402Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5557231Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5557249Z 
2025-05-07T20:33:08.5557368Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5557416Z 
2025-05-07T20:33:08.5557522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5557752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5557830Z     T=4096,
2025-05-07T20:33:08.5557908Z     D=5120,
2025-05-07T20:33:08.5558002Z     scale_ub=None,
2025-05-07T20:33:08.5558092Z     contiguous=True,
2025-05-07T20:33:08.5558177Z     compiled=False,
2025-05-07T20:33:08.5558259Z )
2025-05-07T20:33:08.5558479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5558664Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5558669Z 
2025-05-07T20:33:08.5558746Z     @given(
2025-05-07T20:33:08.5558865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5558971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5559087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5559214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5559338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5559416Z     )
2025-05-07T20:33:08.5559670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5559767Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5559846Z         self,
2025-05-07T20:33:08.5559933Z         T: int,
2025-05-07T20:33:08.5560012Z         D: int,
2025-05-07T20:33:08.5560111Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5560211Z         contiguous: bool,
2025-05-07T20:33:08.5560298Z         compiled: bool,
2025-05-07T20:33:08.5560379Z     ) -> None:
2025-05-07T20:33:08.5560482Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5560557Z     
2025-05-07T20:33:08.5560726Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5562520Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5562529Z 
2025-05-07T20:33:08.5562687Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5562701Z 
2025-05-07T20:33:08.5562806Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5563028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5563115Z     T=2048,
2025-05-07T20:33:08.5563191Z     D=5120,
2025-05-07T20:33:08.5563274Z     scale_ub=None,
2025-05-07T20:33:08.5563370Z     contiguous=False,
2025-05-07T20:33:08.5563455Z     compiled=False,
2025-05-07T20:33:08.5563532Z )
2025-05-07T20:33:08.5563762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5563939Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5563943Z 
2025-05-07T20:33:08.5564021Z     @given(
2025-05-07T20:33:08.5564188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5564291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5564418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5564542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5564664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5564750Z     )
2025-05-07T20:33:08.5564996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5565097Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5565187Z         self,
2025-05-07T20:33:08.5565267Z         T: int,
2025-05-07T20:33:08.5565346Z         D: int,
2025-05-07T20:33:08.5565499Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5565591Z         contiguous: bool,
2025-05-07T20:33:08.5565689Z         compiled: bool,
2025-05-07T20:33:08.5565772Z     ) -> None:
2025-05-07T20:33:08.5565869Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5565956Z     
2025-05-07T20:33:08.5566130Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5567875Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5567893Z 
2025-05-07T20:33:08.5568015Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5568020Z 
2025-05-07T20:33:08.5568123Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5568354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5568433Z     T=4096,
2025-05-07T20:33:08.5568516Z     D=7168,
2025-05-07T20:33:08.5568609Z     scale_ub=None,
2025-05-07T20:33:08.5568700Z     contiguous=True,
2025-05-07T20:33:08.5568786Z     compiled=True,
2025-05-07T20:33:08.5568871Z )
2025-05-07T20:33:08.5569089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5569270Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.5569275Z 
2025-05-07T20:33:08.5569352Z     @given(
2025-05-07T20:33:08.5569465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5569569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5569737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5569863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5569976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5570059Z     )
2025-05-07T20:33:08.5570306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5570410Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5570493Z         self,
2025-05-07T20:33:08.5570571Z         T: int,
2025-05-07T20:33:08.5570725Z         D: int,
2025-05-07T20:33:08.5570825Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5570917Z         contiguous: bool,
2025-05-07T20:33:08.5571014Z         compiled: bool,
2025-05-07T20:33:08.5571093Z     ) -> None:
2025-05-07T20:33:08.5571189Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5571274Z     
2025-05-07T20:33:08.5571445Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5573242Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5573254Z 
2025-05-07T20:33:08.5573374Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5573378Z 
2025-05-07T20:33:08.5573492Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5573714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5573795Z     T=2048,
2025-05-07T20:33:08.5573882Z     D=5120,
2025-05-07T20:33:08.5573967Z     scale_ub=1200.0,
2025-05-07T20:33:08.5574054Z     contiguous=False,
2025-05-07T20:33:08.5574190Z     compiled=False,
2025-05-07T20:33:08.5574264Z )
2025-05-07T20:33:08.5574479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5574662Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5574666Z 
2025-05-07T20:33:08.5574746Z     @given(
2025-05-07T20:33:08.5574874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5574975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5575092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5575218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5575334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5575410Z     )
2025-05-07T20:33:08.5575663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5575759Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5575838Z         self,
2025-05-07T20:33:08.5575928Z         T: int,
2025-05-07T20:33:08.5576004Z         D: int,
2025-05-07T20:33:08.5576102Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5576198Z         contiguous: bool,
2025-05-07T20:33:08.5576284Z         compiled: bool,
2025-05-07T20:33:08.5576372Z     ) -> None:
2025-05-07T20:33:08.5576467Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5576545Z     
2025-05-07T20:33:08.5576723Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5578517Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5578526Z 
2025-05-07T20:33:08.5578657Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5578663Z 
2025-05-07T20:33:08.5578786Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5579037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5579120Z     T=4096,
2025-05-07T20:33:08.5579196Z     D=7168,
2025-05-07T20:33:08.5579280Z     scale_ub=1200.0,
2025-05-07T20:33:08.5579418Z     contiguous=True,
2025-05-07T20:33:08.5579507Z     compiled=False,
2025-05-07T20:33:08.5579588Z )
2025-05-07T20:33:08.5579804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5579974Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5579978Z 
2025-05-07T20:33:08.5580064Z     @given(
2025-05-07T20:33:08.5580180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5580284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5580405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5580522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5580640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5580716Z     )
2025-05-07T20:33:08.5581066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5581173Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5581255Z         self,
2025-05-07T20:33:08.5581333Z         T: int,
2025-05-07T20:33:08.5581417Z         D: int,
2025-05-07T20:33:08.5581513Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5581601Z         contiguous: bool,
2025-05-07T20:33:08.5581692Z         compiled: bool,
2025-05-07T20:33:08.5581774Z     ) -> None:
2025-05-07T20:33:08.5581870Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5581951Z     
2025-05-07T20:33:08.5582118Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5583954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5583963Z 
2025-05-07T20:33:08.5584080Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5584085Z 
2025-05-07T20:33:08.5584192Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5584414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5584492Z     T=16384,
2025-05-07T20:33:08.5584575Z     D=7168,
2025-05-07T20:33:08.5584659Z     scale_ub=None,
2025-05-07T20:33:08.5584745Z     contiguous=False,
2025-05-07T20:33:08.5584833Z     compiled=True,
2025-05-07T20:33:08.5584905Z )
2025-05-07T20:33:08.5585119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5585304Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5585309Z 
2025-05-07T20:33:08.5585386Z     @given(
2025-05-07T20:33:08.5585509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5585610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5585724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5585847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5585960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5586037Z     )
2025-05-07T20:33:08.5586288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5586424Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5586507Z         self,
2025-05-07T20:33:08.5586592Z         T: int,
2025-05-07T20:33:08.5586670Z         D: int,
2025-05-07T20:33:08.5586768Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5586866Z         contiguous: bool,
2025-05-07T20:33:08.5586953Z         compiled: bool,
2025-05-07T20:33:08.5587043Z     ) -> None:
2025-05-07T20:33:08.5587140Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5587214Z     
2025-05-07T20:33:08.5587388Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5589207Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5589215Z 
2025-05-07T20:33:08.5589341Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5589346Z 
2025-05-07T20:33:08.5589451Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5589714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5589802Z     T=4096,
2025-05-07T20:33:08.5589884Z     D=7168,
2025-05-07T20:33:08.5589967Z     scale_ub=None,
2025-05-07T20:33:08.5590060Z     contiguous=True,
2025-05-07T20:33:08.5590143Z     compiled=False,
2025-05-07T20:33:08.5590224Z )
2025-05-07T20:33:08.5590441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5590615Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5590620Z 
2025-05-07T20:33:08.5590704Z     @given(
2025-05-07T20:33:08.5590866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5590965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5591087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5591207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5591327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5591408Z     )
2025-05-07T20:33:08.5591658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5591763Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5591842Z         self,
2025-05-07T20:33:08.5591920Z         T: int,
2025-05-07T20:33:08.5592007Z         D: int,
2025-05-07T20:33:08.5592104Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5592192Z         contiguous: bool,
2025-05-07T20:33:08.5592286Z         compiled: bool,
2025-05-07T20:33:08.5592364Z     ) -> None:
2025-05-07T20:33:08.5592459Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5592540Z     
2025-05-07T20:33:08.5592713Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5594501Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5594510Z 
2025-05-07T20:33:08.5594630Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5594634Z 
2025-05-07T20:33:08.5594740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5594961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5595091Z     T=16384,
2025-05-07T20:33:08.5595176Z     D=7168,
2025-05-07T20:33:08.5595258Z     scale_ub=None,
2025-05-07T20:33:08.5595343Z     contiguous=True,
2025-05-07T20:33:08.5595431Z     compiled=False,
2025-05-07T20:33:08.5595505Z )
2025-05-07T20:33:08.5595727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5595911Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.5595916Z 
2025-05-07T20:33:08.5596034Z     @given(
2025-05-07T20:33:08.5596151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5596258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5596372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5596488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5596610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5596685Z     )
2025-05-07T20:33:08.5596941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5597039Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5597117Z         self,
2025-05-07T20:33:08.5597199Z         T: int,
2025-05-07T20:33:08.5597278Z         D: int,
2025-05-07T20:33:08.5597377Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5597475Z         contiguous: bool,
2025-05-07T20:33:08.5597603Z         compiled: bool,
2025-05-07T20:33:08.5597687Z     ) -> None:
2025-05-07T20:33:08.5597789Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5597868Z     
2025-05-07T20:33:08.5598037Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5599820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5599866Z 
2025-05-07T20:33:08.5599987Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5599998Z 
2025-05-07T20:33:08.5600103Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5600323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5600410Z     T=16384,
2025-05-07T20:33:08.5600488Z     D=7168,
2025-05-07T20:33:08.5600571Z     scale_ub=1200.0,
2025-05-07T20:33:08.5600664Z     contiguous=True,
2025-05-07T20:33:08.5600749Z     compiled=False,
2025-05-07T20:33:08.5600822Z )
2025-05-07T20:33:08.5601044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5601221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5601228Z 
2025-05-07T20:33:08.5601316Z     @given(
2025-05-07T20:33:08.5601433Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5601531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5601651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5601770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5601883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5601963Z     )
2025-05-07T20:33:08.5602210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5602304Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5602386Z         self,
2025-05-07T20:33:08.5602463Z         T: int,
2025-05-07T20:33:08.5602539Z         D: int,
2025-05-07T20:33:08.5602642Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5602730Z         contiguous: bool,
2025-05-07T20:33:08.5602822Z         compiled: bool,
2025-05-07T20:33:08.5602899Z     ) -> None:
2025-05-07T20:33:08.5603061Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5603142Z     
2025-05-07T20:33:08.5603310Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5605051Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5605102Z 
2025-05-07T20:33:08.5605223Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5605227Z 
2025-05-07T20:33:08.5605329Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5605561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5605640Z     T=128,
2025-05-07T20:33:08.5605717Z     D=5120,
2025-05-07T20:33:08.5605807Z     scale_ub=1200.0,
2025-05-07T20:33:08.5605893Z     contiguous=False,
2025-05-07T20:33:08.5605985Z     compiled=False,
2025-05-07T20:33:08.5606058Z )
2025-05-07T20:33:08.5606312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5606493Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.5606502Z 
2025-05-07T20:33:08.5606578Z     @given(
2025-05-07T20:33:08.5606693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5606800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5606913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5607028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5607148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5607263Z     )
2025-05-07T20:33:08.5607516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5607609Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5607685Z         self,
2025-05-07T20:33:08.5607768Z         T: int,
2025-05-07T20:33:08.5607849Z         D: int,
2025-05-07T20:33:08.5607949Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5608045Z         contiguous: bool,
2025-05-07T20:33:08.5608132Z         compiled: bool,
2025-05-07T20:33:08.5608215Z     ) -> None:
2025-05-07T20:33:08.5608316Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5608388Z     
2025-05-07T20:33:08.5608553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5608635Z     
2025-05-07T20:33:08.5608726Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5608860Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5608948Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5609028Z         x0 = x[:, :D]
2025-05-07T20:33:08.5609121Z         x1 = x[:, D:]
2025-05-07T20:33:08.5609195Z     
2025-05-07T20:33:08.5609279Z         if contiguous:
2025-05-07T20:33:08.5609380Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5609471Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5609544Z     
2025-05-07T20:33:08.5609643Z         if scale_ub is not None:
2025-05-07T20:33:08.5609751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5609889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5609972Z             )
2025-05-07T20:33:08.5610049Z         else:
2025-05-07T20:33:08.5610146Z             scale_ub_tensor = None
2025-05-07T20:33:08.5610228Z     
2025-05-07T20:33:08.5610359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5610458Z             op = silu_mul_quant
2025-05-07T20:33:08.5610543Z             if compiled:
2025-05-07T20:33:08.5610644Z                 op = torch.compile(op)
2025-05-07T20:33:08.5610756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5610876Z     
2025-05-07T20:33:08.5610970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5610974Z 
2025-05-07T20:33:08.5611082Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5611209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5611315Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5611425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5611928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5612109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5612473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5612701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5613057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5613154Z     kernel = self.compile(
2025-05-07T20:33:08.5613543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5613720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5613885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5613890Z 
2025-05-07T20:33:08.5614108Z self = <triton.compiler.compiler.ASTSource object at 0x7f158e0300d0>
2025-05-07T20:33:08.5614879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5615397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158e041ca0>}
2025-05-07T20:33:08.5616264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5616461Z context = <triton._C.libtriton.ir.context object at 0x7f158df87c70>
2025-05-07T20:33:08.5616466Z 
2025-05-07T20:33:08.5616642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5616912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5617029Z                            module_map=module_map)
2025-05-07T20:33:08.5617190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5617289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5617369Z E       ^
2025-05-07T20:33:08.5617729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5617736Z 
2025-05-07T20:33:08.5618146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5618158Z 
2025-05-07T20:33:08.5618262Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5618488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5618571Z     T=2048,
2025-05-07T20:33:08.5618650Z     D=7168,
2025-05-07T20:33:08.5618733Z     scale_ub=None,
2025-05-07T20:33:08.5618828Z     contiguous=False,
2025-05-07T20:33:08.5618913Z     compiled=False,
2025-05-07T20:33:08.5618985Z )
2025-05-07T20:33:08.5619209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5619386Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.5619390Z 
2025-05-07T20:33:08.5619472Z     @given(
2025-05-07T20:33:08.5619590Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5619738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5619865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5619984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5620099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5620180Z     )
2025-05-07T20:33:08.5620427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5620521Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5620649Z         self,
2025-05-07T20:33:08.5620729Z         T: int,
2025-05-07T20:33:08.5620806Z         D: int,
2025-05-07T20:33:08.5620912Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5621052Z         contiguous: bool,
2025-05-07T20:33:08.5621148Z         compiled: bool,
2025-05-07T20:33:08.5621226Z     ) -> None:
2025-05-07T20:33:08.5621321Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5621405Z     
2025-05-07T20:33:08.5621580Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5623374Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5623391Z 
2025-05-07T20:33:08.5623513Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5623517Z 
2025-05-07T20:33:08.5623619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5623846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5623924Z     T=128,
2025-05-07T20:33:08.5624040Z     D=7168,
2025-05-07T20:33:08.5624136Z     scale_ub=1200.0,
2025-05-07T20:33:08.5624221Z     contiguous=True,
2025-05-07T20:33:08.5624309Z     compiled=True,
2025-05-07T20:33:08.5624382Z )
2025-05-07T20:33:08.5624601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5624778Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5624782Z 
2025-05-07T20:33:08.5624860Z     @given(
2025-05-07T20:33:08.5624981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5625084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5625198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5625315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5625433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5625505Z     )
2025-05-07T20:33:08.5625757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5625860Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5625936Z         self,
2025-05-07T20:33:08.5626021Z         T: int,
2025-05-07T20:33:08.5626102Z         D: int,
2025-05-07T20:33:08.5626199Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5626297Z         contiguous: bool,
2025-05-07T20:33:08.5626386Z         compiled: bool,
2025-05-07T20:33:08.5626471Z     ) -> None:
2025-05-07T20:33:08.5626574Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5626647Z     
2025-05-07T20:33:08.5626819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5626900Z     
2025-05-07T20:33:08.5626993Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5627124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5627214Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5627295Z         x0 = x[:, :D]
2025-05-07T20:33:08.5627384Z         x1 = x[:, D:]
2025-05-07T20:33:08.5627459Z     
2025-05-07T20:33:08.5627544Z         if contiguous:
2025-05-07T20:33:08.5627687Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5627781Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5627854Z     
2025-05-07T20:33:08.5627951Z         if scale_ub is not None:
2025-05-07T20:33:08.5628057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5628199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5628286Z             )
2025-05-07T20:33:08.5628362Z         else:
2025-05-07T20:33:08.5628466Z             scale_ub_tensor = None
2025-05-07T20:33:08.5628580Z     
2025-05-07T20:33:08.5628712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5628809Z             op = silu_mul_quant
2025-05-07T20:33:08.5628895Z             if compiled:
2025-05-07T20:33:08.5628996Z                 op = torch.compile(op)
2025-05-07T20:33:08.5629109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5629182Z     
2025-05-07T20:33:08.5629275Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5629282Z 
2025-05-07T20:33:08.5629390Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5629518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5629619Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5629725Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5630136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5630238Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5630744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5630842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5631204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5631432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5631854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5631947Z     kernel = self.compile(
2025-05-07T20:33:08.5632330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5632517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5632643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5632650Z 
2025-05-07T20:33:08.5632858Z self = <triton.compiler.compiler.ASTSource object at 0x7f158df89460>
2025-05-07T20:33:08.5633635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5634139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f15824c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f158df3a0d0>}
2025-05-07T20:33:08.5634886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5635082Z context = <triton._C.libtriton.ir.context object at 0x7f158deb89b0>
2025-05-07T20:33:08.5635087Z 
2025-05-07T20:33:08.5635261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5635524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5635632Z                            module_map=module_map)
2025-05-07T20:33:08.5635800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5635897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5635977Z E       ^
2025-05-07T20:33:08.5636378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5636386Z 
2025-05-07T20:33:08.5636796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5636801Z 
2025-05-07T20:33:08.5636913Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5637135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5637252Z     T=128,
2025-05-07T20:33:08.5637337Z     D=7168,
2025-05-07T20:33:08.5637419Z     scale_ub=1200.0,
2025-05-07T20:33:08.5637505Z     contiguous=True,
2025-05-07T20:33:08.5637599Z     compiled=False,
2025-05-07T20:33:08.5637673Z )
2025-05-07T20:33:08.5637896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5638067Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.5638072Z 
2025-05-07T20:33:08.5638148Z     @given(
2025-05-07T20:33:08.5638275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5638373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5638491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5638616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5638794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5638882Z     )
2025-05-07T20:33:08.5639146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5639241Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5639323Z         self,
2025-05-07T20:33:08.5639402Z         T: int,
2025-05-07T20:33:08.5639480Z         D: int,
2025-05-07T20:33:08.5639585Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5639676Z         contiguous: bool,
2025-05-07T20:33:08.5639763Z         compiled: bool,
2025-05-07T20:33:08.5639849Z     ) -> None:
2025-05-07T20:33:08.5639943Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5640386Z     
2025-05-07T20:33:08.5640615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5640693Z     
2025-05-07T20:33:08.5640787Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5640915Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5643190Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5643199Z 
2025-05-07T20:33:08.5643325Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.5643332Z 
2025-05-07T20:33:08.5643438Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5643673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5643751Z     T=128,
2025-05-07T20:33:08.5643829Z     D=5120,
2025-05-07T20:33:08.5643920Z     scale_ub=1200.0,
2025-05-07T20:33:08.5644005Z     contiguous=True,
2025-05-07T20:33:08.5644092Z     compiled=True,
2025-05-07T20:33:08.5644170Z )
2025-05-07T20:33:08.5644385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5644554Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5644566Z 
2025-05-07T20:33:08.5644645Z     @given(
2025-05-07T20:33:08.5644763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5644864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5644980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5645095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5645313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5645390Z     )
2025-05-07T20:33:08.5645636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5645737Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5645814Z         self,
2025-05-07T20:33:08.5645892Z         T: int,
2025-05-07T20:33:08.5645978Z         D: int,
2025-05-07T20:33:08.5646074Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5646221Z         contiguous: bool,
2025-05-07T20:33:08.5646308Z         compiled: bool,
2025-05-07T20:33:08.5646386Z     ) -> None:
2025-05-07T20:33:08.5646487Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5646561Z     
2025-05-07T20:33:08.5646728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5646806Z     
2025-05-07T20:33:08.5646899Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5647023Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5648860Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5648868Z 
2025-05-07T20:33:08.5648989Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:08.5648994Z 
2025-05-07T20:33:08.5649102Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5649322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5649406Z     T=128,
2025-05-07T20:33:08.5649483Z     D=7168,
2025-05-07T20:33:08.5649564Z     scale_ub=None,
2025-05-07T20:33:08.5649716Z     contiguous=True,
2025-05-07T20:33:08.5649798Z     compiled=True,
2025-05-07T20:33:08.5649873Z )
2025-05-07T20:33:08.5650096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5650263Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.5650270Z 
2025-05-07T20:33:08.5650347Z     @given(
2025-05-07T20:33:08.5650468Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5650570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5650689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5650807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5650919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5650999Z     )
2025-05-07T20:33:08.5651242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5651333Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5651421Z         self,
2025-05-07T20:33:08.5651498Z         T: int,
2025-05-07T20:33:08.5651572Z         D: int,
2025-05-07T20:33:08.5651673Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5651759Z         contiguous: bool,
2025-05-07T20:33:08.5651843Z         compiled: bool,
2025-05-07T20:33:08.5651927Z     ) -> None:
2025-05-07T20:33:08.5652023Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5652102Z     
2025-05-07T20:33:08.5652268Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5654088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:08.5654103Z 
2025-05-07T20:33:08.5654221Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:08.5654356Z =============================== warnings summary ===============================
2025-05-07T20:33:08.5654666Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:08.5655009Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:08.5655309Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:08.5656183Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:08.5656422Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:08.5656427Z 
2025-05-07T20:33:08.5656643Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:08.5656810Z ================= 1 failed, 1 deselected, 3 warnings in 19.48s =================
2025-05-07T20:33:10.1276882Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:10.1905725Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:10.1905972Z 
2025-05-07T20:33:10.1906580Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:10.1907164Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:10.1907562Z 
2025-05-07T20:33:10.1907569Z 
2025-05-07T20:33:10.1907742Z 
2025-05-07T20:33:10.1925326Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:10.2007169Z Post job cleanup.
2025-05-07T20:33:10.2986778Z [command]/usr/bin/git version
2025-05-07T20:33:10.3028577Z git version 2.47.1
2025-05-07T20:33:10.3064054Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/2163174b-65c6-42d6-aaf5-a8a6664bfa26/.gitconfig'
2025-05-07T20:33:10.3074567Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2163174b-65c6-42d6-aaf5-a8a6664bfa26' before making global git config changes
2025-05-07T20:33:10.3075445Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:10.3079829Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:10.3119560Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:10.3154796Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:10.3489775Z Entering 'external/asmjit'
2025-05-07T20:33:10.3556430Z Entering 'external/composable_kernel'
2025-05-07T20:33:10.3629043Z Entering 'external/cpuinfo'
2025-05-07T20:33:10.3697219Z Entering 'external/cutlass'
2025-05-07T20:33:10.3773338Z Entering 'external/googletest'
2025-05-07T20:33:10.3838447Z Entering 'external/hipify_torch'
2025-05-07T20:33:10.3903700Z Entering 'external/json'
2025-05-07T20:33:10.3990579Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:10.4015550Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4027709Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:10.4059068Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:10.4386136Z Entering 'external/asmjit'
2025-05-07T20:33:10.4431292Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4473522Z Entering 'external/composable_kernel'
2025-05-07T20:33:10.4516123Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4565129Z Entering 'external/cpuinfo'
2025-05-07T20:33:10.4607921Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4650203Z Entering 'external/cutlass'
2025-05-07T20:33:10.4693063Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4744840Z Entering 'external/googletest'
2025-05-07T20:33:10.4788039Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4830350Z Entering 'external/hipify_torch'
2025-05-07T20:33:10.4873415Z http.https://github.com/.extraheader
2025-05-07T20:33:10.4915296Z Entering 'external/json'
2025-05-07T20:33:10.4961234Z http.https://github.com/.extraheader
2025-05-07T20:33:10.5118527Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:10.5153691Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:10.5163902Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:10.5164261Z ##[endgroup]
2025-05-07T20:33:10.5267936Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:21.2950980Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:37.8098561Z Cleaning up orphan processes